SMI CEO claims Nvidia wants SSDs with 100 million IOPS — up to 33X performance uplift could eliminate AI GPU bottlenecks

(Image credit: Kioxia)

Now that the AI industry has exceptionally high-performance GPUs with high-bandwidth memory (HBM), one of the bottlenecks that AI training and inference systems face is storage performance. To that end, Nvidia is working with partners to build SSDs that can hit random read performance of 100 million input/output operations per second (IOPS) in small-block workloads, according to Wallace C. Kou, who spoke with Tom's Hardware in an exclusive interview.

"Right now, they are aiming for 100 million IOPS — which is huge," Kou told Tom's Hardware.

Modern AI accelerators, such as Nvidia's B200, feature HBM3E memory bandwidth of around 8 TB/s, which significantly exceeds the capabilities of modern storage subsystems in both overall throughput and latency. Modern PCIe 5.0 x4 SSDs top at around 14.5 GB/s and deliver 2 – 3 million IOPS for both 4K and 512B random reads. Although 4K blocks are better suited for bandwidth, AI models typically perform small, random fetches, which makes 512B blocks a better fit for their latency-sensitive patterns. However, increasing the number of I/O operations per second by 33 times is hard, given the limitations of both SSD controllers and NAND memory.

In fact, Kioxia is already working on an 'AI SSD' based on its XL-Flash memory designed to surpass 10 million 512K IOPS. The company currently plans to release this drive during the second half of next year, possibly to coincide with the rollout of Nvidia's Vera Rubin platform. To get to 100 million IOPS, one might use multiple 'AI SSDs.'

However, the head of SMI believes that achieving 100 million IOPS on a single drive featuring conventional NAND with decent cost and power consumption will be extremely hard, so a new type of memory might be needed.

"I believe they are looking for a media change," said Kou. "Optane was supposed to be the ideal solution, but it is gone now. Kioxia is trying to bring XL-NAND and improve its performance. SanDisk is trying to introduce High Bandwidth Flash (HBF), but honestly, I don't really believe in it. Right now, everyone is promoting their own technology, but the industry really needs something fundamentally new. Otherwise, it will be very hard to achieve 100 million IOPS and still be cost-effective."

Currently, many companies, including Micron and SanDisk, are developing new types of non-volatile memory. However, when these new types of memory will be commercially viable is something that even the head of Silicon Motion is not sure about.

Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

See more SSDs News

TOPICS

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

17 Comments Comment from the forums

John Nemesh

Well, I want an affordable gaming GPU that doesn't suck...I guess we all can't get what we want now, can we?
Reply
Notton

Nvidia should buy Optane from Intel then.
Reply
hotaru251

Notton said:
Nvidia should buy Optane from Intel then.
pretty sure they sold that off ages ago didnt they??
Reply
Notton

hotaru251 said:
pretty sure they sold that off ages ago didnt they??
No, Intel only sold off their NAND division to SK Hynix.
I'm pretty sure Intel still owns the IP rights to Optane.
Reply
usertests

Notton said:
Nvidia should buy Optane from Intel then.
Was high IOPS one of the benefits of Optane? I thought it was mostly about latency, low queue depth performance, and cost-per-bit being less than DRAM.

There have been companies searching/working on would-be NAND and DRAM replacements for decades. If the hundreds of billions flowing into AI gets one of those technologies past the vaporware stage, that could have immense benefits for everyone.

We don't even need a universal memory necessarily. You could kick NAND to the curb if you could match/beat it at some combination of latency, performance and endurance (which suffer as you go to QLC and beyond), and density/cost. Cost can be higher but fall as production scales up.
Reply
JRStern

Notton said:
Nvidia should buy Optane from Intel then.
Probably still find it at a Cupertino Goodwill store for a dollar.
Reply
JRStern

Is this really so hard? I mean, to fake? Get thirty-three slower drives, and a boatload of DRAM for buffers, and a pool of processors, and a little hack code to assure transaction consistency, and there you are. Sounds like a Google interview question.
Reply
JRStern

usertests said:
Was high IOPS one of the benefits of Optane?
read yes, down to bit or byte level.
write, not so much, slow and power hungry and overheated chip.

No flash SSD is going to enjoy being written for hours at ludicrous speed, either, but that shouldn't be a problem I don't think, need major clarification of the requirements on that point.
Reply
edzieba

JRStern said:
read yes, down to bit or byte level.
write, not so much, slow and power hungry and overheated chip.

No flash SSD is going to enjoy being written for hours at ludicrous speed, either, but that shouldn't be a problem I don't think, need major clarification of the requirements on that point.
Write very extremely yes, in fact. The big benefit of Optane (and why database users gobbled it up) was that unlike NAND flash it had true bit-level writes without associated block wear.

For NAND flash to write a bit, you need to:
- first read an entire block (typically 4kb)
- store it temporarily in RAM (either on the drive on on the host, if on the host you have to shuffle it over the PCIe bus too)
- then erase the entire 4kb block (this is where NAND wear occurs)
- then modify the bit in the 4kb block in RAM
- finally rewrite the modified block
(TRIM and wear-levelling means you skip the erase and instead write to a 'fresh' block, and wait to erase until either sufficient 'writes' accumulate to that block or you run out of unTRIMed blocks, but both of those occur rapidly if you are doing bit-level operations rather than block-level)

3DXP/Optane:
- write the bit
Reply
Silicon Mage

You can never eliminate bottlenecks, you just push them around to other components.

If you make 1 part faster math says the next slowest component becomes the bottleneck.
Reply

Show more comments