Tachyum Submits Bid to Build 20 Exaflops Supercomputer

Tachyum on Tuesday said that it had submitted a bid to the Department of Energy to build a 20 exaflops supercomputer in 2025. The machine would be based on the company's next-generation Prodigy processors featuring a proprietary microarchitecture that can be used for different types of workloads.

The U.S. DoE wants a 20 exaflops supercomputer with a 20MW–60MW power consumption to be delivered by 2025. The system is set to be installed at Oak Ridge National Laboratory (ORNL) and will complement the lab's Frontier system that went online earlier this year.

Tachyum does not disclose which hardware it proposed to the DoE, but only says that it has its 128-core Prodigy processor today as well as a higher-performing Prodigy 2 processor in its roadmap, so it is safe to say that by 2025 it will have the latter on hand and it could be able to address the upcoming system.

Tachyum's Prodigy is a universal homogeneous processor packing up to 128 proprietary 64-bit VLIW cores that feature two 1024-bit vector units per core and one 4096-bit matrix unit per core. Tachyum expected its flagship Prodigy T16128-AIX processor to offer up to 90 FP64 teraflops for HPC as well as up to 12 'AI petaflops' for AI inference and training (presumably when running INT8 or FP8 workloads). Prodigy consumes up to 950W and uses liquid cooling.

That was all before Tachyum sued Cadence, its intellectual property provider, for lower-than-expected performance of its Prodigy processor. We have no idea what the current performance expectations are for the chip.

In theory, Tachyum could power an exaflops system using over 11,000 of its Prodigy processors, though power consumption of such a machine would be gargantuan. Presumably, Prodigy 2 has a better chance to meet the needs of a next-generation exascale system than the original Prodigy.

There is currently one exaflops-class supercomputer in the U.S., the 1.1 exaflops Frontier system at Oak Ridge National Laboratory (ORNL) that is based on AMD's 64-core EPYC CPUs as well as Instinct MI250X compute GPUs. There are two more exascale systems being built in the USA, the 2 exaflops Aurora machine powered by Intel's 4thGeneration Xeon Scalable processors and Xe-HPC compute GPUs (aka, Ponte Vecchio) as well as the ">2 exaflops" El Capitan supercomputer based on AMD's Zen 4 architecture EPYC CPUs and Instinct MI300 GPUs.

One of the interesting things about the DoE's supercomputing plans is that from now on it wants to upgrade its high-performance compute capabilities every 12–24 months, not every 4–5 years. As a result, the DoE will be more eager to adopt exotic architectures like Tachyum's Prodigy than it is today.

"We also wish to explore the development of an approach that moves away from monolithic acquisitions toward a model for enabling more rapid upgrade cycles of deployed systems, to enable faster innovation on hardware and software," a DoE document reads. "One possible strategy would include increased reuse of existing infrastructure so that the upgrades are modular. A goal would be to reimagine systems architecture and an efficient acquisition process that allows continuous injection of technological advances to a facility (e.g., every 12–24 months rather than every 4–5 years). Understanding the tradeoffs of these approaches is one goal of this RFI, and we invite responses to include perceived benefits and/or disadvantages of this modular upgrade approach."

One of the advantages that Tachyum's Prodigy has over traditional CPUs and GPUs for AI and HPC workloads is that it is tailored for both types of workloads, which is why Prodigy can be used for AI workloads when its HPC capabilities are not used and vice versa. The DoE may or may not adopt Tachyum for any of its upcoming supercomputers, but the company hopes to be awarded with an appropriate contract.

TOPICS

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

8 Comments Comment from the forums

samopa

What about the software support ? Surely the DoE does not to rebuild their (already had) software stack to fully utilize the maximum capability of new hardware ... :unsure:
Reply
DougMcC

samopa said:
What about the software support ? Surely the DoE does not to rebuild their (already had) software stack to fully utilize the maximum capability of new hardware ... :unsure:

If their software stack is not linux and standard HPC libraries they're next level stupid. So it should just be a recompile and tuning (some of which will already have been done by the hardware vendor) to move to a new architecture.
Reply
SunMaster

I was very surprised to see the world VLIW in there.
Reply
bit_user

That was all before Tachyum sued Cadence, its intellectual property provider, for lower-than-expected performance of its Prodigy processor.
Uh, that's not what the prior article said. I went back to check, and the only complaints it mentioned were basically lack of functional and timely delivery of the promised IP. And Tachyum simply said this forced them to source the IP from other suppliers, causing them schedule delays.

If that episode had any impact on performance, it sounds like it was just by delaying their product launch so it had to face newer products from its competitors.
Reply
bit_user

One of the interesting things about the DoE's supercomputing plans is that from now on it wants to upgrade its high-performance compute capabilities every 12–24 months, not every 4–5 years.
I definitely like the idea of having an upgrade path, and maybe you don't build out the entire machine at once, but you either add nodes or replace older nodes on a periodic basis. If that's what they mean, then cool. Otherwise, an upgrade cycle of 12-24 months sounds incredibly wasteful.
Reply
bit_user

DougMcC said:
If their software stack is not linux and standard HPC libraries they're next level stupid.
You might think so, but I know AMD put a tremendous amount of effort into their HiP stack for porting CUDA applications to run on ROCm, and that seemed driven by certain HPC contracts they had.

DougMcC said:
So it should just be a recompile and tuning (some of which will already have been done by the hardware vendor) to move to a new architecture.
When you're talking about such large machines, I think porting is a slightly more involved endeavor. Yeah, you can just use something like OpenMP and get a quick, easy speedup. However, if you really want your application to get a good speedup, you typically have to invest a lot more time & effort.
Reply
bit_user

SunMaster said:
I was very surprised to see the world VLIW in there.
Oh, yeah ...no. VLIW is quite dominant in DSPs and therefore a lot of deep learning ASICs.

To get good scaling, you need to keep the cores small and efficient. That eliminates out-of-order execution from consideration. So, that limits us to in-order cores. There are two basic options for squeezing more performance out of in-order cores: VLIW and SIMD.

GPUs combine SIMD with SMT, in order to hide memory latency. You can certainly combine VLIW and SIMD. You can even combine VLIW with SMT, though it won't be as efficient as SMT is in other contexts.

What I really wonder whether they use a conventional cache hierarchy, or if their SRAM is directly addressable and software managed. GPUs started mostly with the latter, but have gradually been embracing a more traditional cache hierarchy, over time.
Reply
SunMaster

bit_user said:
Oh, yeah ...no. VLIW is quite dominant in DSPs and therefore a lot of deep learning ASICs.

I didn't know that, but it makes sense when the chip isn't running multiple applications or an OS.
Reply

Show more comments