DeepSeek brings disruption to AI-optimized parallel file systems, releases powerful new open-source Fire-Flyer File System

(Image credit: Google)

DeepSeek AI has made its Fire-Flyer Fire System (3FS) parallel file system fully open-source this week, as part of its Open Source Week event. The disruptive AI company from China brags that 3FS can hit 7.3 TB/s aggregate read throughput in its own server data clusters, where DeepSeek has been using 3FS to organize its servers since at least 2019.

3FS is a Linux-based parallel file system designed for use in AI-HPC operations, where many data storage servers are being constantly accessed by GPU nodes for training LLMs. 3FS is unique from other file systems thanks largely to its almost singular prioritization of random read speeds above all else, and almost completely ignoring read caching.

When training AI models, compute units need to access random training data constantly, and reading this data is a one-time-only process. Therefore, a read cache is nearly useless and is largely done away with by 3FS. In fact, using the read cache when training LLMs may be potentially harmful; as LLMs are basically just super-tuned inference machines, reading the same data in the same order repeatedly has the potential to link completely different data as a set to the language model.

The team responsible for operating one of DeepSeek's deep learning clusters, Fire-Flyer 2, published this paper last August outlining using 3FS in the custom-built system. In Fire-Flyer 2, DeepSeek utilized 180 storage nodes, each loaded with 16 16TB SSDs and two 200Gbps NUCs. These nodes served 10,000 PCIe Nvidia A100 GPUs, built out in much cheaper servers than Nvidia's proprietary DGX-A100 products.

TOPICS

Sunny Grimm is a contributing writer for Tom's Hardware. He has been building and breaking computers since 2017, serving as the resident youngster at Tom's. From APUs to RGB, Sunny has a handle on all the latest tech news.

20 Comments Comment from the forums

bit_user

I'd be pretty surprised if OpenAI, Google, and others hadn't also done a lot of optimization in this area. I wonder what they're using. I'm also not sure Ceph is the best point of comparison, but distributed filesystems are an area I really know very little about.

I am reminded of the substantial kernel optimizations Facebook/Meta contributed on storage I/O. At the time, I didn't connect this to AI, but perhaps that was among the driving factors:
https://www.phoronix.com/news/Linux-14M-IOPS-Per-Core (2022-03-23)
https://www.phoronix.com/news/Linux-Caching-Time-Block-IO (2024-01-06)
https://www.phoronix.com/news/BFQ-IO-Better-Scalability (2024-01-21)
https://www.phoronix.com/news/Linux-6.10-IO_uring (2024-05-12)
https://www.phoronix.com/news/Uncached-Buffered-IO-2024 (2024-11-13)
In particular, the RWF_UNCACHED optimization seems relevant, here. I would also point out that the above optimizations are orthogonal to what you do at the distributed filesystem layer and it's likely DeepSeek took advantage of at least some of these.

Kudos to DeepSeek for releasing and publicizing their solution. Let's hope we see more of this.
Reply
qxp

Indeed, there are other similar filesystems like BeeGFS

https://www.beegfs.io/c/
I think the remarks in the article that the filesystem will be popular with many users are a little overoptimistic. The cacheless nature only works when you have very fast network and you are still stuck with latency penalty compared to caching data in RAM.

The aggregate 7.7 TB/s speed is very nice for a cluster, but newer CPUs and existing GPUs can easily achieve TB/s memory speed access. Thus, for a cluster, some caching has to happen somewhere or the filesystem will be overwhelmed. In case of DeepSeek 3FS works well because each piece of data needs a lot of operations to process, but this is not so for every workload.

What would be interesting is to see the improvements in 3FS and BeeGFS incorporated into regular NFS, so you can have the best of both worlds - caching when you are reusing the data and fast random access.
Reply
bit_user

qxp said:
Indeed, there are other similar filesystems, like BeeGFS

https://www.beegfs.io/c/
Thanks!

qxp said:
I think the remarks in the article that the filesystem will be popular with many users are a little overoptimistic. The cache less nature only works when you have very fast network and you are still stuck with latency penalty compared to caching data in RAM.
Caching wastes memory, if you're doing reads with no reuse. Cacheless reads can potentially save overhead in the kernel, because it doesn't continually need to find entries in the block cache to evict, although I'm not sure how much overhead that actually causes.

qxp said:
The aggregate 7.7 TB/s speed is very nice for a cluster, but newer CPUs and existing GPUs can easily achieve TB/s memory speed access.
Well, GPUs are limited in persistent or network data access by PCIe speeds, which currently top out at 64 GB/s for a x16 connection. That said, Nvidia has been doing a lot with NVLink and Infiniband (as well as UltraEthernet?), so it's possible that's the avenue over which the data arrives.

qxp said:
What would be interesting is to see the improvements in 3FS and BeeGFS incorporated into regular NFS, so you can have the best of both worlds - caching when you are reusing the data and fast random access.
I think that's not in the cards. NFS is very much about centralized storage and point-to-point reads & writes. For its maintainers and key users, I'd expect simplicity and reliability are far more important. Scalable, parallel, network-based filesystems have been around for a long time, yet NFS hasn't really veered into such territory.
Reply
qxp

bit_user said:
Caching wastes memory, if you're doing reads with no reuse. Cacheless reads can potentially save overhead in the kernel, because it doesn't continually need to find entries in the block cache to evict, although I'm not sure how much overhead that actually causes.
I think it depends a lot on the algorithm. If you are limited by the overhead in the kernel, then it means you are limited by the bandwidth to the networked data, and not by the compute your CPUs and GPUs can provide.

I would personally consider such situation unsatisfactory and would try to change the algorithm so it becomes compute, or, at least, memory bandwidth limited.

The way DeepSeek and others are running they need to stream a large amount of data through compute and the filesystem optimizes this streaming part. Even then they have to cache some data - otherwise it would vanish before you can use it. BeeGFS does have a cache but it is limited and it does its own cache eviction thus avoiding that kernel overhead.

For general applications, however, the cacheless or small-cache nature is a problem, because those general applications usually have some measure of locality, hitting the same data repeatedly.

Obviously, a memory mapped database such MariaDB or parquet or RMVL would want to be cached or you hit a bottleneck.

But even for LLM applications you want caching - llama.cpp memory maps the weight tensors and have provisions to improve locality. This way if you have 600+ GB Deepseek on your SSD you can still run it with useable speed on less than half a terabyte of RAM. This would not work if the tensors were served using 3FS or BeeGFS.

bit_user said:
Well, GPUs are limited in persistent or network data access by PCIe speeds, which currently top out at 64 GB/s for a x16 connection. That said, Nvidia has been doing a lot with NVLink and Infiniband (as well as UltraEthernet?), so it's possible that's the avenue over which the data arrives.
I was thinking more in terms of systems with unified memory. Ryzen AI MAX+ is an example of such in consumer space, Xeon Max and newer Xeon 9xxx have also much larger memory bandwidth, same goes for newer Epycs. The fact that NVidia GPUs have to suck data through a straw is their big weakness, one they tried to fix through dedicated server architecture.

bit_user said:
I think that's not in the cards. NFS is very much about centralized storage and point-to-point reads & writes. For its maintainers and key users, I'd expect simplicity and reliability are far more important. Scalable, parallel, network-based filesystems have been around for a long time, yet NFS hasn't really veered into such territory.
You are right, it does depends on the preferences of the maintainers. So either they decide to add features to let NFS scale up similar to network filesystems, or some of those network filesystems will add features (like proper caching) to compete with NFS. Not sure what is easier.
Reply
klatte42

“In Fire-Flyer 2, DeepSeek utilized 180 storage nodes, each loaded with 16 16TB SSDs and two 200Gbps NUCs.”

Just to be sure, should “NUC” be “NIC”? If not, I’d love to understand the setup better.
Reply
JRStern

Caching allows readahead.
You ask for and receive the first 4k, but the transfer was 64k, and the track (remember tracks?) was 2mb, and the cylinder even moreso. Of course this matters far less with SSDs, I don't even know what the current structures look like, even the flash chips/modules/controllers may have the cache, etc.
Reply
bit_user

JRStern said:
Caching allows readahead.
You ask for and receive the first 4k, but the transfer was 64k,
An optimized app doesn't need the kernel to do readahead for it. You can request data in large enough chunks to be efficient and maintain a queue of multiple outstanding reads via io_uring.

JRStern said:
Of course this matters far less with SSDs, I don't even know what the current structures look like, even the flash chips/modules/controllers may have the cache, etc.
Modern SSDs that include DRAM seem to have about 1 GB per couple TB, I think. They can also buffer directly in host memory. For the most part, this memory seems to be used for optimizing writes and caching FTL (Flash Translation Layer) structures, but I can't say the drives aren't doing any read ahead. If we found a good benchmark comparing QD=1 sequential 4k vs. QD=1 random 4k reads, it'd be pretty clear if the drive were doing it (I'd expect so).

Even so, modern kernels still default to some modest amount of read-ahead (128 kB, IIRC), when doing normal I/O (i.e. not O_DIRECT).
Reply
jp7189

Pure random reads result in 100% cache misses which means read cache is pure overhead and waste.
Reply
qxp

jp7189 said:
Pure random reads result in 100% cache misses which means read cache is pure overhead and waste.
Actually no, you have to work to make that happen. For example, suppose your data is 10TB, while your memory is only 1TB. Then there is a 10% chance for a random read to hit data already in memory and you have a 10% speedup - nothing to sneer at. For many applications the chance you will want recently used data is higher and you get a higher speedup.

To make cache useless you need to engineer your code to always request data that is not in memory and in a random order so that readahead is useless too.
Reply
jp7189

qxp said:
Actually no, you have to work to make that happen. For example, suppose your data is 10TB, while your memory is only 1TB. Then there is a 10% chance for a random read to hit data already in memory and you have a 10% speedup - nothing to sneer at. For many applications the chance you will want recently used data is higher and you get a higher speedup.

To make cache useless you need to engineer your code to always request data that is not in memory and in a random order so that readahead is useless too.
You're thinking is both somehow too complex and simplistic all at the same time.

In this context and per the article, the application is engineered to be random. Therefore, darn close to 100% cache misses. It's no more complicated than that.

To your other point, you'd have to populate your theoretical 1TB of cache ahead of time, and that act would be insanely expensive (wasted hardware, energy, time) to have effectively (in your example) a 90% cache miss rate. Also, cache isn't infinitely faster than SSD, so 10% hit rate doesn't equal 10% performance improvement.

In the real world, cache gets populated by data that was previously pulled from SSD. In the training of LLMs, you pull data <generally> no more than once per epoch. By the time you get around to needing that data again, the cache would long since been evicted. So, in this respect, there is 0 benefit from cache. However, all that cache management isn't free. There are upfront system costs, energy, and wasted processing cycles. Therefore, the more cache, the more wasteful the system (in the context of the system this article is describing).
Reply

Show more comments