Marvell develops custom HBM memory solutions — interface shrinks and higher performance on the menu

Marvell
(Image credit: Marvell)

Marvell has announced custom high-bandwidth memory (CHBM) solution for its custom XPUs designed for AI applications at its Analyst Day 2024. Developed in partnership with leading memory makers, CHBM promises to optimize performance, power, memory capacity, die size, and cost for specific XPU designs. CHBM will be compatible with Marvell's custom XPUs and will not be a part of a JEDEC-defined HBM standard, at least initially.

Marvell's custom HBM solution allows tailoring interfaces and stacks for a particular application, though the company has not disclosed any details. One of Marvell's goals is to reduce the real estate that industry-standard HBM interfaces occupy inside processors. Freeing up the real estate available to compute and features. The company asserts that with its proprietary die-to-die I/O, it will not only be able to pack up to 25% more logic into its custom XPUs, but also potentially install up to 33% more CHBM memory packages next to compute chiplets to increase the amount of DRAM available to the processor. In addition, the company expects to cut memory interface power consumption by up to 70%.

Because Marvell's CHBM does not rely on a JEDEC-specified standard, on the hardware side of things it will require a new controller and customizable physical interface, new die-to-die interfaces, and overhauled HBM base dies. The new Marvell die-to-die HBM interface will have a bandwidth of 20 Tbps/mm (2.5 TB/s per mm), which is a significant increase over 5 Tbps/mm (625 GB/s per mm) that HBM offers today, based on a slide from the company's Analyst Day published by ServeTheHome. Over time Marvell envisions bufferless memory with a 50 Tbps/mm (6.25 TB/s per mm).

Marvell does not specify how wide its CHBM interface will be. Marvell does not disclose many details about its custom HBM solution except saying that it 'enhances XPUs by serializing and speeding up the I/O interfaces between its internal AI compute accelerator silicon dies and the HBM base dies,' which somewhat implies on a narrower interface width compared to industry-standard HBM3E or HBM4 solutions. Yet, it looks like cHBM solutions will be customizable.

"Enhancing XPUs by tailoring HBM for specific performance, power, and total cost of ownership is the latest step in a new paradigm in the way AI accelerators are designed and delivered," said Will Chu, Senior Vice President and General Manager of the Custom, Compute and Storage Group at Marvell. "We are very grateful to work with leading memory designers to accelerate this revolution and, help cloud data center operators continue to scale their XPUs and infrastructure for the AI era."

Working with Micron, Samsung, and SK hynix is crucial for successful implementation of Marvell's CHBM as it sets the stage for relatively widespread availability of custom high bandwidth memory.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • abufrejoval
    That's potentially a very smart money move, perhaps also technologically!

    It must have been irking both the HBM and the ASIC makers that only NVidia was making so much money, when the main cost driver and performance enabler was the HBM.

    So if they can make an AI ASIC at break-through cost via proprietary memory technology, they can stitch up buyers and keep the GPU competitors out.

    Of course they could still just wind up in a blind alley, but at least they have the balls.
    Reply
  • bit_user
    I sometimes wonder how much more performance is possible, if you ditch traditional DRAM interface protocols and use a lower-level interface. Perhaps that's the main advantage PIM (processing in memory) solutions are getting, but I wonder if it really needs to be limited to PIM. So, perhaps those are a lot of the gains they're reaping?
    Reply
  • abufrejoval
    bit_user said:
    I sometimes wonder how much more performance is possible, if you ditch traditional DRAM interface protocols and use a lower-level interface. Perhaps that's the main advantage PIM (processing in memory) solutions are getting, but I wonder if it really needs to be limited to PIM. So, perhaps those are a lot of the gains they're reaping?
    DRAM is very low-level, there isn't layers of abstraction you can eliminate.

    But in its original form its one bit per chip and cycle out of a matrix that got bigger and bigger.

    And all speedups since have been about getting groups of bits instead of single ones or reusing active row-buffers when that doesn't change from one access to the next.

    HBM has been about stacking DRAM chips and going ultra wide to get them all in parallel for bandwidth, PIM is about doing some processing on the HBM (or HMC!) base die carrier where signals from the varies dies can still arrive in parallel and because that die carrier could have transistors etched into it basically for free. The width and signal lengths towards the host would then much more constrained because of physics and the nature of the RAM sockets as being what they are.

    The main interest is that you can do some of this PIM without touching the CPUs, simply by turning address lines into op-codes and/or even downloading deterministic finite state machine configuration data to a reserved address space for more programmability.

    DRAM row buffers could become very wide ALUs if you use more than one and then have them operations between them with full row parallelism etc.

    Other more radical variants of PIM employ RAM that actually includes some degree of boolean or finite state logic, Micron once presented a really interesting variant of that some years ago to Bull, which included me at the time.

    But that stuff wasn't particularly suited to AI or LLMs, while there is lots of actors who try to do exactly that. Some names have popped up for more than a decade every now and then, none have succesful products at scale so far.

    Coming back to your original point: the aim is mostly to do exactly the opposite, put more abstractions into the RAM chips themselves and adding layers which enabled you to extract results rather than raw data.
    Reply
  • bit_user
    abufrejoval said:
    DRAM is very low-level, there isn't layers of abstraction you can eliminate.
    It still limits you to accessing one row at a time, for instance. Could there be multiple row buffers, to enable greater concurrency? Does only one bank have to be active at a time?

    abufrejoval said:
    Coming back to your original point: the aim is mostly to do exactly the opposite, put more abstractions into the RAM chips themselves and adding layers which enabled you to extract results rather than raw data.
    That's PIM, though. Not what Marvell is talking about.

    I think a better move is to stack DRAM on more general-purpose compute dies, giving you PIM-like efficiency with almost the same level of generality we enjoy today.
    Reply
  • abufrejoval
    bit_user said:
    It still limits you to accessing one row at a time, for instance. Could there be multiple row buffers, to enable greater concurrency? Does only one bank have to be active at a time?
    That's what I also imagined first and discussed a bit with colleagues from bull and the CEA before the AI boom hit.
    And it's what I tried to describe with that super wide ALU doing row-by-row computations

    What you could do there is things like weight accumulations but you have to go towards something much more fine grained to be more generic. It's what Micron's automata processor wanted to do, which I believe I've mentioned before.

    Upmem has been at it also, perhaps for 20 yeary by now, but unless I look for them, I can't see them topping headlines ever, even if it always looks as if they started having a product ready to use last month.

    bit_user said:

    That's PIM, though. Not what Marvell is talking about.
    Yes Marvell is talking, but not saying anything yet. So it's hard to compare. For me it's really just the fact that they are trying to drill into that HBM fortress, and they are doing it with the perspective of a company which designs AI ASICs for Google and others.

    That's a lot like Nvidia starting on a new memory technology instead of trying to break into Windows laptops.

    Just the announcement in itself is remarkable because of who is involved, not what they say.

    And you have to remember that people could not buy Blackwells for nearly all of 2024, because they had all been allocated to OpenAI and a few others, who had already signed contracts with NVidia.

    Yet the bottleneck on numbers wasn't TMSC or the NVidia chips, it was the HBM memory chips or the limited capacity for their assembly which restricted the market and drove the numbers.

    So why was NVidia the only one making a killing if they weren't even producing the most precious resource?

    That's what Marvell targets and it really only needs to be as good as HBM and more freely available to burst some bubbles. If it is better, too, the market might shift significantly.

    bit_user said:

    I think a better move is to stack DRAM on more general-purpose compute dies, giving you PIM-like efficiency with almost the same level of generality we enjoy today.

    However you break the memory wall without breaking the bank is good.
    Reply