SK Hynix has announced that it had developed the industry's first 12-layer 24GB HBM3 memory stacks that offer both high density and extreme bandwidth of 819 GB/s. The 12-Hi HBM3 products maintain the same height as the company's existing 8-layer HBM3 products, meaning that they are easy to deploy.
SK Hynix's 24GB HBM3 known good stack die (KGSD) product places twelve 16Gb memory devices connected using through silicon vias (TSVs) on a base layer with a 1024-bit interface. The device features a 6400 MT/s data transfer speed and therefore the whole 24GB HBM3 module offers a bandwidth of 819.2 GB/s.
Depending on the actual memory subsystem, such modules can enable 3.2 TB/s – 4.915 TB/s of bandwidth for 96GB of memory over a 4096-bit or 144GB of memory over a 6140-bit interface, respectively. To put the numbers into context, Nvidia's H100 NVL — the most advanced HBM3 implementation to date — features 96GB of memory with 3.9 TB/s of bandwidth for each of its two GH100 compute GPUs.
Stacking 12 layers of HBM DRAM on top of each other is challenging for several reasons. Firstly, it is hard to drill some 60,000 or more TSV holes through the package to connect all twelve layers. Secondly, 12-Hi HBM DRAM packages cannot get physically higher than 8-Hi HBM KGSDs (typically, 700 – 800 microns, 720 microns in the case of Samsung), so it will be extremely complicated (if possible at all) to install such HBM3 KGSDs next to a CPU or a GPU that have a fixed height. To that end, DRAM makers like SK Hynix need to either reduce thickness of an individual DRAM layer without sacrificing yields or performance (which brings in a slew of challenges) or reduce the gap between layers as well shrink the base layer.
SK Hynix says that that to build its 12-Hi HBM3 product with the same height as its 8Hi HBM3 device, it used its Advanced Mass Reflow Molded Underfill (MR-MUF) encapsulation technology that involves mass reflow (MR) chip attach to shrink the base layer and molded underfill (MUF) processes to reduce die-to-die spacing.
SK Hynix has delivered samples of its 24GB HBM3 product to several eagerly anticipating customers. The product is currently undergoing performance evaluation and is set to enter mass production in the second half of the year.
Or is my understanding flawed and these HBM3 modules have to actually be on the GPU Die? Which means only 1 maybe 2 modules per GPU Die?
Cost-wise, HBM3 is currently much more expensive than the same capacity of GDDR6.
I dont expect to see HBM in consumer space anytime in the near future...
The HBM interface is very similar to DDR5 apart from having separate access to RAS and CAS lines, an optional half-row activation feature if you want to split each sub-channel into two more semi-independent channels, only one DQS per 32bits and no bus termination at either end, which I'd say makes HBM simpler overall.
The only genuinely problematic difference IMO is needing eight of those slightly modified DDR5 controllers per stack.
I highly doubt that will be the case. If they do that, it would be for the halo models.
take a look at the 3070 and the 4070 for example, they reduced the memory bus from 256bit 192bit. the throughput remained about the same at 448 and 505gb/s by using GRRD6 vs GDDR6X
Another key difference, it seems to me, is the drivers you'd need for communicating with DIMMs. As you point out, the power requirements are very different.
Finally, since the interface of HBM3 runs at a much lower frequency, perhaps certain aspects of the memory controller could be simplified and made more efficient.
What's interesting is that I'm guessing you think the RTX 3070 needed that entire 448 GB/s of bandwidth. We don't know that, however. GPUs sometimes have more memory channels just as a way to reach a certain memory capacity. The memory chips they used for the RTX 4070 each have twice the capacity. If they'd kept the width at 256-bits, then it would've either stayed at 8 GB or gone all the way up to 16 GB. And 16 GB would've made it more expensive.
It isn't much different from how next-gen DDRx is prohibitively expensive for the first 2-3 years from initial launch, then prices start dropping as more of the mainstream commits to next-gen stuff until the new stuff becomes the new obvious leader on price-performance.
Nothing forces you to deploy independent memory controllers all the way down to the finest sub-banking option. You can operate each chip in the stack as a single 128bits-wide channel too and you can use stacks with fewer than eight chips if you don't need the largest capacity configuration.
The most logical thing to do in the future is buying raw HBM DRAM stacks and supply your own base chips to adapt the stack whichever way you need. Then you can mux a full stack through a single 128bits interface if all you want is the 8-32GB single stack capacity. Doubly so if your design already spits out its memory controllers into chiplets/tiles like AMD is doing with the RX7800-7900, you get your custom base die for the cost of TSVs to attach the raw HBM stack to extra silicon that is already designed-in.
With more advanced chiplet/tile based designs featuring active interposers, which will likely be commonplace three years from now, the HBM base die functions could also be baked directly into the active interposers themselves.