SK Hynix Samples 24GB HBM3 Modules: Up to 819 GB/s

SK Hynix has announced that it had developed the industry's first 12-layer 24GB HBM3 memory stacks that offer both high density and extreme bandwidth of 819 GB/s. The 12-Hi HBM3 products maintain the same height as the company's existing 8-layer HBM3 products, meaning that they are easy to deploy.

SK Hynix's 24GB HBM3 known good stack die (KGSD) product places twelve 16Gb memory devices connected using through silicon vias (TSVs) on a base layer with a 1024-bit interface. The device features a 6400 MT/s data transfer speed and therefore the whole 24GB HBM3 module offers a bandwidth of 819.2 GB/s.

Depending on the actual memory subsystem, such modules can enable 3.2 TB/s – 4.915 TB/s of bandwidth for 96GB of memory over a 4096-bit or 144GB of memory over a 6140-bit interface, respectively. To put the numbers into context, Nvidia's H100 NVL — the most advanced HBM3 implementation to date — features 96GB of memory with 3.9 TB/s of bandwidth for each of its two GH100 compute GPUs.

Stacking 12 layers of HBM DRAM on top of each other is challenging for several reasons. Firstly, it is hard to drill some 60,000 or more TSV holes through the package to connect all twelve layers. Secondly, 12-Hi HBM DRAM packages cannot get physically higher than 8-Hi HBM KGSDs (typically, 700 – 800 microns, 720 microns in the case of Samsung), so it will be extremely complicated (if possible at all) to install such HBM3 KGSDs next to a CPU or a GPU that have a fixed height. To that end, DRAM makers like SK Hynix need to either reduce thickness of an individual DRAM layer without sacrificing yields or performance (which brings in a slew of challenges) or reduce the gap between layers as well shrink the base layer.

SK Hynix says that that to build its 12-Hi HBM3 product with the same height as its 8Hi HBM3 device, it used its Advanced Mass Reflow Molded Underfill (MR-MUF) encapsulation technology that involves mass reflow (MR) chip attach to shrink the base layer and molded underfill (MUF) processes to reduce die-to-die spacing.

SK Hynix has delivered samples of its 24GB HBM3 product to several eagerly anticipating customers. The product is currently undergoing performance evaluation and is set to enter mass production in the second half of the year.

TOPICS

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

13 Comments Comment from the forums

bit_user

Okay, so it sounds like 12 is the upper-limit on HBM height. At least, for this generation of processors and memories.
Reply
sickbrains

Wait, how large is a single module? Could you potentially replace 2gb GDDR6X dies with these HBM3 dies? That would really disrupt the market segmentation going on right now with 24gb being reserved for halo products and more for workstations. Imagine a 6090 with 8 packets of these HBM3 modules! Or more likely a 9900XTX since AMD already used HBM2 in earlier GPU's.

Or is my understanding flawed and these HBM3 modules have to actually be on the GPU Die? Which means only 1 maybe 2 modules per GPU Die?
Reply
bit_user

sickbrains said:
Or is my understanding flawed and these HBM3 modules have to actually be on the GPU Die? Which means only 1 maybe 2 modules per GPU Die?
Yeah, they have to be in the same package as the GPU die. The article mentions the interface per stack is 1024 data bits, which it's only feasible to route & drive through an interposer. That compares with 32 bits per GDDR6 chip. However, it runs at a much lower frequency and you don't have as many stacks as you typically have GDDR6 chips. So, it's only like 3-5 times the bandwidth, rather than 32x.

Cost-wise, HBM3 is currently much more expensive than the same capacity of GDDR6.
Reply
Lucky_SLS

With how things are going, i am seeing that consumer versions get the GDDR6X treatment and professional GPU just have the GRRD6X replaced with HBM3 for higher bandwidth and VRAM.

I dont expect to see HBM in consumer space anytime in the near future...
Reply
bit_user

Lucky_SLS said:
With how things are going, i am seeing that consumer versions get the GDDR6X treatment and professional GPU just have the GRRD6X replaced with HBM3 for higher bandwidth and VRAM.
Nothing about it is a drop-in replacement, though. The memory controllers are very different, between the two. It's just one of many things that differentiate the AI/HPC processors from their rendering-oriented GPU cousins.
Reply
InvalidError

Lucky_SLS said:
I dont expect to see HBM in consumer space anytime in the near future...
Necessity will probably bring HBM or something HBM-like with fewer or narrower channels to the GPU and CPU consumer space within the next five years. It'll be the only practical way to meet bandwidth requirements without stupidly high external memory bus power and related PCB costs.

bit_user said:
Nothing about it is a drop-in replacement, though. The memory controllers are very different, between the two. It's just one of many things that differentiate the AI/HPC processors from their rendering-oriented GPU cousins.
The HBM interface is very similar to DDR5 apart from having separate access to RAS and CAS lines, an optional half-row activation feature if you want to split each sub-channel into two more semi-independent channels, only one DQS per 32bits and no bus termination at either end, which I'd say makes HBM simpler overall.

The only genuinely problematic difference IMO is needing eight of those slightly modified DDR5 controllers per stack.
Reply
Lucky_SLS

InvalidError said:
Necessity will probably bring HBM or something HBM-like with fewer or narrower channels to the GPU and CPU consumer space within the next five years. It'll be the only practical way to meet bandwidth requirements without stupidly high external memory bus power and related PCB costs.

The HBM interface is very similar to DDR5 apart from having separate access to RAS and CAS lines, an optional half-row activation feature if you want to split each sub-channel into two more semi-independent channels, only one DQS per 32bits and no bus termination at either end, which I'd say makes HBM simpler overall.

The only genuinely problematic difference IMO is needing eight of those slightly modified DDR5 controllers per stack.

I highly doubt that will be the case. If they do that, it would be for the halo models.

take a look at the 3070 and the 4070 for example, they reduced the memory bus from 256bit 192bit. the throughput remained about the same at 448 and 505gb/s by using GRRD6 vs GDDR6X
Reply
bit_user

InvalidError said:
The HBM interface is very similar to DDR5 apart from having separate access to RAS and CAS lines, an optional half-row activation feature if you want to split each sub-channel into two more semi-independent channels, only one DQS per 32bits and no bus termination at either end, which I'd say makes HBM simpler overall.
At some level, DRAM is DRAM. I get it. But, HBM3 has 32-bit sub-channels, which means you have 32 of those per stack, rather than the 2 that you get per DDR5 DIMM. So, that's a pretty big deal, and not something you can just gloss over.

Another key difference, it seems to me, is the drivers you'd need for communicating with DIMMs. As you point out, the power requirements are very different.

Finally, since the interface of HBM3 runs at a much lower frequency, perhaps certain aspects of the memory controller could be simplified and made more efficient.
Reply
bit_user

Lucky_SLS said:
take a look at the 3070 and the 4070 for example, they reduced the memory bus from 256bit 192bit. the throughput remained about the same at 448 and 505gb/s by using GRRD6 vs GDDR6X
They also increased the amount of L2 cache by about 10x. We saw how much Infinity Cache helped RDNA2, so it's a similar idea.

What's interesting is that I'm guessing you think the RTX 3070 needed that entire 448 GB/s of bandwidth. We don't know that, however. GPUs sometimes have more memory channels just as a way to reach a certain memory capacity. The memory chips they used for the RTX 4070 each have twice the capacity. If they'd kept the width at 256-bits, then it would've either stayed at 8 GB or gone all the way up to 16 GB. And 16 GB would've made it more expensive.
Reply
InvalidError

Lucky_SLS said:
I highly doubt that will be the case. If they do that, it would be for the halo models.
HBM is still fundamentally still the same technology as any other DRAM. The main reason it is more expensive is relatively low volume production. All that would be necessary to bring the price down is for GPU and DRAM manufacturers to coordinate a hard switch.

It isn't much different from how next-gen DDRx is prohibitively expensive for the first 2-3 years from initial launch, then prices start dropping as more of the mainstream commits to next-gen stuff until the new stuff becomes the new obvious leader on price-performance.

bit_user said:
At some level, DRAM is DRAM. I get it. But, HBM3 has 32-bit sub-channels, which means you have 32 of those per stack, rather than the 2 that you get per DDR5 DIMM. So, that's a pretty big deal, and not something you can just gloss over.
Nothing forces you to deploy independent memory controllers all the way down to the finest sub-banking option. You can operate each chip in the stack as a single 128bits-wide channel too and you can use stacks with fewer than eight chips if you don't need the largest capacity configuration.

The most logical thing to do in the future is buying raw HBM DRAM stacks and supply your own base chips to adapt the stack whichever way you need. Then you can mux a full stack through a single 128bits interface if all you want is the 8-32GB single stack capacity. Doubly so if your design already spits out its memory controllers into chiplets/tiles like AMD is doing with the RX7800-7900, you get your custom base die for the cost of TSVs to attach the raw HBM stack to extra silicon that is already designed-in.

With more advanced chiplet/tile based designs featuring active interposers, which will likely be commonplace three years from now, the HBM base die functions could also be baked directly into the active interposers themselves.
Reply

Show more comments