'GPUs still rule' asserts graphics guru Raja Koduri in response to custom AI silicon advocate
But he is also mulling 'a new architecture' that will evolve with GPU lessons learned.
Raja Koduri, a GPU veteran who has designed graphics processors for AMD, Apple, ATI, S3 Graphics, and Intel, believes that GPUs will not be replaced by custom-built silicon for artificial intelligence (AI) and high-performance computing (HPC) any time soon. However, new architectures can still be designed on the heels of GPUs to better address these workloads, he believes. In fact, the custom silicon endeavors of AWS, Graphcore, Google, Microsoft, and Tesla probably prove his point about architectures.
"We have heard this statement since 2016, but GPUs still rule," Raja Koduri in response to an X post about the future of AI compute. "Why? I am still learning, but my observations so far: the 'purpose' of purpose-built silicon is not stable. AI is not as static as some people imagined and trivialize [like] 'it is just a bunch of matrix multiplies'."
Originally designed to process highly parallel graphics workloads (and power the best graphic cards), graphics processing units (GPUs) have evolved to accelerate artificial AI and HPC workloads. As a result, AI and HPC software workloads are optimized for GPUs to a large degree, because of Nvidia CUDA dominance. Nvidia's GPUs have become so versatile in adding support for new data formats that it has become inherently harder - even for custom silicon - to compete against them. AI and HPC GPUs are yet to take off for AMD and Intel.
Many purpose-built silicon solutions are lacking in important architectural support areas, often shifting the burden to software developers. This shift is somewhat problematic due to the scarcity of new system software talent and the overreliance on a small, aging pool of existing experts, opined Koduri.
"The system architecture — things like page tables, memory management, interrupt handling, debugging — of GPUs evolved over two decades and is a necessary evil to support production software stacks," Koduri wrote. "Many of the purpose-built silicon is deficient here and throw the burden onto 'software' people. There is not much new young system software talent coming into the workforce these days, so everyone competes for the same small pool of aging talent."
While Raja's comments make a lot of sense, he is a board-of-directors member of Tenstorrent, a producer of custom-built AI accelerators and HPC CPUs based on the RISC-V instruction set architecture, which makes his statements somewhat controversial. Yet, he expressed optimism for the development of new architectures that will address AI and HPC workloads. These future architectures would ideally emerge from the lessons learned so far, potentially offering a new purpose and overcoming the current limitations of both GPUs and purpose-built silicon.
"But I am still optimistic that a new architecture with new purpose will evolve from the lessons learned so far," the renowned GPU specialist teased.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
US govt says Cisco gear often targeted in China's Salt Typhoon attacks on 8 telecommunications providers — issues Cisco-specific advice to patch networks to fend off attacks
Weak demand prevents Amazon from deploying AMD's Instinct AI accelerators in the cloud — the company plans to strengthen its portfolio with Nvidia Blackwell GPUs
-
bit_user The big deficiency GPUs have on AI workloads is data locality. Data movement is energy-intensive. If you look at the dataflow architectures that have come onto the AI scene, they all have a lot more local SRAM distributed among the processing elements, in order to address this. Nvidia has massively boosted the amount of SRAM it incorporates into its GPUs, over the past couple generations, but it's still as little as 1/10th as much as some purpose-built AI chips have done.Reply
This gets at the heart of GPU vs. AI accelerator architecture, because realtime rendering famously has very little data locality and depends heavily on random-access performance. That's precisely what made GPUs such bandwidth monsters that crypto coins like Ethereum were able to exploit. It's also what's stymied multi-GPU rendering and what's kept multi-die (i.e. multi-compute die) GPUs from breaking into the mainstream.
The main reasons I think Nvidia's GPUs haven't already been displaced by purpose-built AI accelerators are:
Nvidia has such momentum, scale, and market dominance that its integration and software support are second to none. People trying to innovate on AI techniques & applications don't want to waste a bunch of time fussing with broken or incomplete software stacks or integration, which has been the bane of AMD's efforts and I'm sure most custom AI hardware is in even worse shape.
Nvidia has so many resources that they can afford to optimize even a sub-optimal architecture, to the point where it can compete with anything out there.
Right now, most of the big AI users care much more about innovation and are willing to live with high hardware prices and operations costs (i.e. mostly due to poor efficiency). If the AI race ever settles down, we could see costs & efficiency starting to bubble up as higher priorities. Already, with large-scale deployment of LLMs, it seems to be getting a lot of mindshare, though LLMs push the technology so far that I'm not sure anyone (other than possibly Cerebras) has a much more efficient alternative.
the 'purpose' of purpose-built silicon is not stable. AI is not as static as some people imagined and trivialize 'it is just a bunch of matrix multiplies'."
I've been saying this for ages. People (usually hardware designers) are quick to trivialize the requirements of AI hardware. In actual fact, you need quite a bit of programmability and not just a bunch of SRAM and fast matrix multiplies.
Nvidia's GPUs have become so versatile in adding support for new data formats that it has become inherently harder - even for custom silicon - to compete against them.
I think the custom silicon has actually been leading the charge on custom data formats. For instance, Google Brain was the pioneer of the BF16, back when GPUs still only supported IEEE-754 FP16. Lots of examples exist of other innovations in data formats, from companies like Nervana, Tenstorrent, and some others I'm forgetting.