Elon Musk plans to scale the xAI supercomputer to a million GPUs — currently at over 100,000 H100 GPUs and counting

Charles Liang of Supermicro and Elon Musk in gigafactory
(Image credit: Charles Liang)

Elon Musk's AI company, xAI, is set to expand its Colossus supercomputer to over one million GPUs, reports the Financial Times. Thus, the expanded Colossus machine will be one of the most powerful supercomputers in the world. However, it will require significant investments, supply, and infrastructure availability.

Colossus, which is used to train the large language model behind Grok, already operates over 100,000 H100 processors from Nvidia and is set to double the number of GPUs shortly to become the largest supercomputer in a single building. The plan to increase the number of GPUs is underway, though this one is going to take a sizeable amount of time and effort. To accomplish the mission, xAI is working with Nvidia, Dell, and Supermicro. Furthermore, Memphis, Tennessee, where Colossus is located, has reportedly established a dedicated xAI operations team to aid the endeavor.

It is unclear whether xAI plans to use current-generation Hopper or next-generation Blackwell GPUs during the expansion. The Blackwell platform is expected to scale better than Hopper, so it makes more sense to use the upcoming technology instead of the current one. But in any case, getting the 800,000 – 900,000 AI GPUs is hard, as demand for Nvidia's products is overwhelming. Another challenge is to make 1,000,000 GPUs work in concert with maximum efficiency and, again, Blackwell would make more sense here.

The financial requirements of this expansion are colossal, of course. Acquiring GPUs — costing tens of thousands of dollars each — alongside infrastructure for power and cooling, could push investment into the tens of billions. xAI has raised $11 billion this year and recently secured another $5 billion. Currently, the company is valued at $45 billion.

Unlike rivals such as OpenAI, which partners with Microsoft for computing power, and Anthropic, supported by Amazon, xAI is independently building its supercomputing capacity. This strategy puts the company in a high-stakes race to secure advanced AI hardware, but given the scale of xAI's investments, this actually puts Musk's company ahead of its rivals.

Despite its rapid progress, xAI has faced criticism for allegedly bypassing planning permissions and the project's strain on the regional power grid. To address concerns the company has emphasized grid stability measures, including deploying Tesla's megapack technology to manage power demands.

While xAI's focus on hardware has earned acclaim, its commercial offerings remain limited. Grok reportedly lags behind leading models like ChatGPT and Google's Gemini in both sophistication and user base. However, investors view Colossus as a foundational achievement that demonstrates xAI's ability to rapidly deploy cutting-edge technology.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • Mindstab Thrull
    Just for lulz...
    Imagine what would happen if he went with Intel for GPUs for this project.
    The Foundry would be very busy for quite a while!
    Reply
  • Co BIY
    Intel could handle the scale if their hardware was ready.

    I'm pretty sure they have made their pitch already.
    Reply
  • bit_user
    Mindstab Thrull said:
    Just for lulz...
    Imagine what would happen if he went with Intel for GPUs for this project.
    The Foundry would be very busy for quite a while!
    The Intel product he'd almost certainly use would be Gaudi 3, not their GPUs. However, Gaudi 3 isn't even competitive with H100 and at least some of the tiles are still made by TSMC. Not only that, but it also relies on HBM, which is in critically short supply.

    So, using Gaudi doesn't avoid the TSMC or HBM production bottlenecks and might not even save that much money at these volumes, in spite of what Intel has claimed.
    Reply