Tachyum Prodigy Chip Now Has 192 Universal Cores
More cores, larger die size, but still not taped out.
This week, Tachyum said that by using the latest electronic design automation (EDA) tools it has managed to squeeze 50% more cores into its Prodigy processor while increasing die size by only 20%. The 192-core chip does not seem to exist in silicon as of now and the company did not share when it plans to start its sampling or shipping these processors to interested parties.
Last year Tachyum sued Cadence for providing IP that did not meet its expectations and had to switch to IP from another provider or providers. Because of this, it had to also change RTL simulation and layout tools. The company did not disclose which EDA tools it uses for Prodigy development, but it claims that the new set of programs enabled it to tweak various parameters, resulting in a 50% increase in core count (from 128 to 192), increase L2/L3 cache from 128MB to 192MB, and a jump in SERDES from 64 to 96 per chip. Die size of the processor increased from 500 mm2 to 600 mm2, or by around 20%.
Tachyum asserts that it could squeeze more of its universal cores within the 858 mm2 reticle limit, performance of all cores would be constrained by memory bandwidth, even when paired with 16 DDR5 channels operating at a 7200MT/s data transfer rate.
"We have achieved better results and timing with our new EDA physical design tools," said Dr. Radoslav Danilak, founder and CEO of Tachyum. "[…] while we did not have any choice but to change EDA tools, our physical design (PD) team worked hard to redo physical design and optimizations with the new set of PD tools, as we approach volume-level production."
Tachyum's Prodigy is a versatile processor with up to 192 unique 64-bit VLIW cores that boast two 1024-bit vector units, a 4096-bit matrix unit, a 64KB instruction cache, a 64KB data cache, and a 1MB L2 cache. Interestingly, unused L2 caches from other cores can be repurposed as a supplemental L3 cache.
When Prodigy runs native code, proper compiler optimizations can enable 4-way out-of-order processing (despite the fact that VLIW is meant to be in-order). Furthermore, Prodigy's instruction set architecture allows for enhanced parallelism through specialized 'poison bits.'
Perhaps the most interesting peculiarity of the Prodigy processor is that it can emulate x86, Arm, CUDA and RISC-V binaries without compromising performance, according to Tachyum. Despite past challenges faced by VLIW processors emulating x86 code, Tachyum is optimistic about its performance, even if certain translations might cause a 30-40% drop.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
gg83 As more compute specific cores are being added to SoCs, I find it hard to believe this approach of universal cores will work. I hope they do work though.Reply -
bit_user I've already learned to ignore any news of Tachyum, until the thing actually reaches the hands of end customers. It's pretty amazing the company is still in business, after so many years of shipping nothing but lofty promises.Reply
Worst of all, the most innovative and daring aspects of their ISA seem to have disappeared in the last iteration of news about them. I don't know how much appetite there is for another "me too" ISA, these days. RISC-V seems to be consuming all the oxygen for "the next big thing", while LoongArch and ELBRUS are probably both just floating along on Chinese government funding.
More troubling is that I don't see any news of them contributing patches towards Linux support. They seem to be focusing on FreeBSD, which is going to seriously limit their market. -
bit_user Looking at their website, their last Blog entry is dated June 2022.Reply
The specs claim their CPUs provide "Out-of-Order, 4 instructions per clock", which is old hat. They also claim x86, ARM, and RISC-V "support", which probably means JIT translation, like Apple's Rosetta 2. So, we're basically looking at a core with the same dispatch rate per clock as Sandybridge (from 12 years ago) and the added tax of emulation/translation. That's going to get trashed on general-purpose workloads by AMD's Begamo and Intel's Sierra Forest.
When it comes to AI/Deep learning, they claim 2x 1024-bit vector units per core. Golden Cove (server) and Zen 4 are both at about 1536 bits of total vector execution width, per core. So, not a huge advantage, and it still remains to be seen what their issue rate, latency, and how rich those instructions are.
Next, it seems to have a rather paltry 1 MiB of L2 + L3 cache per core. Compare that to 5 MiB per core in regular Genoa and about 2 MB per core in Sapphire Rapids. AMD's 3D Vcache has shown us how sensitive some workloads are to cache.
Finally, they tout a 4096-bit matrix processor per core, which I think is approximately half of Intel's AMX, though I wouldn't be surprised if it supported a wider variety of operations. For deep learning, Sapphire Rapids gains a lot from its optional HBM. Prodigy's 16-channel DDR5-7200 might be nearly comparable in bandwidth, but that also means surpassing Intel's performance is probably unlikely.
The 4096-bit matrix processor is mentioned on the spec sheet of the older, 48-core model:
https://www.tachyum.com/datasheets/Prodigy%20PB%20848%20v1.0_230815.pdf
Since Intel added AMX, Sapphire Rapids is every bit as "universal" as theirs.gg83 said:As more compute specific cores are being added to SoCs, I find it hard to believe this approach of universal cores will work. I hope they do work though.
I think your skepticism is well-founded. Increasingly, people are going to be using special-purpose accelerators for AI workloads, due to not only the performance but also the efficiency benefits.
In spite of all my nay-saying, I suppose it would probably be quite an accomplishment for a Slovenian company to build an entirely new CPU that's even in the same ballpark as AMD and Intel's latest and greatest. It should compare favorably against the latest LoongArch CPUs, as well. I guess we should also mention some of the RISC-V efforts in progress, such as SiPearl.
However, let's just see if they can actually get anything to market. We've seen this story play out so very many times. A CPU startup makes lofty claims, but underestimates the time and complexity involved in actually getting a working CPU to market. By the time they do, the mainstream players have pretty much already passed them by. The only remotely recent examples I can think of that beat the trend were Japanese (Fujitsu AFX64, PEZY Computing, and Preferred Networks). -
Findecanor I've read on another site that Tachyum has gone a way from VLIW to a more traditional out-of-order architecture. Each instruction is four or eight bytes long.Reply
Emulation of other architectures is supposed to use QEMU. (Like on Apple M1), a core can be switched from WMO to TSO mode for running translated x86 code without memory-barrier instructions everywhere.
Personally, as a geek of low-level system things, I am not wowed by promises of performance numbers. I'm more interested in if its ISA has any benefits (or quirks) for compilers and operating systems compared to other ISAs, and if there are any features that would make it easier to make programs secure.
IMHO, anything like that could give it another reason for existence than just being fast.
But there is so very little information available about it, so we can't tell.