Kioxia works with Nvidia to prep XL-Flash SSD that's 3x faster than any SSD available — 10 million IOPS drive has peer-to-peer GPU connectivity for AI servers

Kioxia
(Image credit: Kioxia)

Kioxia aims to change the storage paradigm with a proposed SSD designed to surpass 10 million input/output operations per second (IOPS) in small-block workloads, the company revealed at its Corporate Strategy Meeting earlier this week. That's three times faster than the peak speeds of many modern SSDs. The company also says it will collaborate with Nvidia in the design. 

One of the performance bottlenecks of modern AI servers is the data transfer between storage and GPUs, as data is currently transferred by the CPU, which significantly increases latencies and extends access times. 

To reach the performance target, Kioxia is designing a new controller specifically tuned to maximize IOPS — beyond 10M 512B IOPS — to enable GPUs to access data at speeds sufficient to keep their cores 100% used at all times. The proposed Kioxia 'AI SSD' is set to utilize the company's single-level cell (SLC) XL-Flash memory, which boasts read latencies in the range of 3 to 5 microseconds, significantly lower than the read latencies of 40 to 100 microseconds offered by SSDs based on conventional 3D NAND. Additionally, by storing one bit per cell, SLC offers faster access times and greater endurance, attributes that are crucial for demanding AI workloads. 

Current high-end datacenter SSDs typically achieve 2 to 3 million IOPS for both 4K and 512-byte random read operations. From a bandwidth perspective, using 4K blocks makes a lot of sense, whereas 512B blocks do not. However, large language models (LLMs) and retrieval-augmented generation (RAG) systems typically perform small, random accesses to fetch embeddings, parameters, or knowledge base entries. In these scenarios, small block sizes, such as 512B, are more representative of actual application behavior than 4K or larger blocks. Therefore, it makes more sense to use 512B blocks to meet the needs of LLMs and RAGs in terms of latencies and use multiple drives for bandwidth. Using smaller blocks could also enable more efficient use of memory semantics for access. 

It is noteworthy that Kioxia does not disclose which host interface its 'AI SSD' will use, although it does not appear to require a PCIe 6.0 interface from a bandwidth perspective.

The 'AI SSD' from Kioxia will also be optimized for peer-to-peer communications between the GPU and SSD, bypassing the CPU for extra performance and lower latency. To that end,  there is another reason why Kioxia (well, and Nvidia) plan to use 512B blocks as GPUs typically operate on cache lines of 32, 64, or 128 bytes internally and their memory subsystems are optimized for burst access to many small, independent memory locations, to keep all the stream processors busy at all times. To that end, 512-byte reads align better with GPU designs.

Kioxia's 'AI SSD' is designed to support AI training setups where large language models (LLMs) require fast, repeated access to massive datasets. Also, Kioxia envisions it being deployed in AI inference applications, particularly in systems that employ retrieval-augmented generation techniques to enhance generative AI outputs with real-time data (i.e., for reasoning). Low-latency, high-bandwidth storage access is crucial for such machines to ensure both low response times and efficient GPU utilization.

The Kioxia 'AI SSD' is scheduled for release in the second half of 2026 and will be developed in collaboration with the world’s largest GPU manufacturer (read Nvidia).

Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

TOPICS
Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • Pierce2623
    I’m assuming using SLC will be a big bottleneck on density unless this uses larger stacks than we’ve ever seen…
    Reply
  • Li Ken-un
    To add to this, I’ve benchmarked the Intel Optane P5800X (3.2 TB) with FIO and it’s capable of hitting 5 million IOps with 512-byte random reads. When the NAND SSDs hit with 10 million IOps and “read latencies in the range of 3 to 5 microseconds,” that’ll finally lay any doubts to rest that Optane will have been ancient technology. Though that’ll be technology released in 2026 versus technology that was released in 2022, four years will have been a remarkably small gap given how far ahead Optane was in 2017 when it was first available.

    Pierce2623 said:
    I’m assuming using SLC will be a big bottleneck on density unless this uses larger stacks than we’ve ever seen…
    Given 128 TB-class QLC SSDs today, they could already do 32 TB SLC SSDs, but chose not to (largest capacity available currently being 3.2 TB). 32 TB would dwarf the amount of DRAM you could stuff into a server.

    It’s also an Iron Triangle problem here. Pick any two: speed, latency, or density. Even Optane never had more than one bit per cell.
    Reply