Elon Musk plans to scale the xAI supercomputer to a million GPUs — currently at over 100,000 H100 GPUs and counting

Charles Liang of Supermicro and Elon Musk in gigafactory
(Image credit: Charles Liang)

Elon Musk's AI company, xAI, is set to expand its Colossus supercomputer to over one million GPUs, reports the Financial Times. Thus, the expanded Colossus machine will be one of the most powerful supercomputers in the world. However, it will require significant investments, supply, and infrastructure availability.

Colossus, which is used to train the large language model behind Grok, already operates over 100,000 H100 processors from Nvidia and is set to double the number of GPUs shortly to become the largest supercomputer in a single building. The plan to increase the number of GPUs is underway, though this one is going to take a sizeable amount of time and effort. To accomplish the mission, xAI is working with Nvidia, Dell, and Supermicro. Furthermore, Memphis, Tennessee, where Colossus is located, has reportedly established a dedicated xAI operations team to aid the endeavor.

It is unclear whether xAI plans to use current-generation Hopper or next-generation Blackwell GPUs during the expansion. The Blackwell platform is expected to scale better than Hopper, so it makes more sense to use the upcoming technology instead of the current one. But in any case, getting the 800,000 – 900,000 AI GPUs is hard, as demand for Nvidia's products is overwhelming. Another challenge is to make 1,000,000 GPUs work in concert with maximum efficiency and, again, Blackwell would make more sense here.

The financial requirements of this expansion are colossal, of course. Acquiring GPUs — costing tens of thousands of dollars each — alongside infrastructure for power and cooling, could push investment into the tens of billions. xAI has raised $11 billion this year and recently secured another $5 billion. Currently, the company is valued at $45 billion.

Unlike rivals such as OpenAI, which partners with Microsoft for computing power, and Anthropic, supported by Amazon, xAI is independently building its supercomputing capacity. This strategy puts the company in a high-stakes race to secure advanced AI hardware, but given the scale of xAI's investments, this actually puts Musk's company ahead of its rivals.

Despite its rapid progress, xAI has faced criticism for allegedly bypassing planning permissions and the project's strain on the regional power grid. To address concerns the company has emphasized grid stability measures, including deploying Tesla's megapack technology to manage power demands.

While xAI's focus on hardware has earned acclaim, its commercial offerings remain limited. Grok reportedly lags behind leading models like ChatGPT and Google's Gemini in both sophistication and user base. However, investors view Colossus as a foundational achievement that demonstrates xAI's ability to rapidly deploy cutting-edge technology.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

Read more
Four banks of xAI's HGX H100 server racks, holding eight servers each.
Elon Musk confirms that Grok 3 is coming soon — pretraining took 10X more compute power than Grok 2 on 100,000 Nvidia H100 GPUs
Sam Altman at OpenAI Dev Day.
OpenAI has run out of GPUs, says Sam Altman — GPT-4.5 rollout delayed due to lack of processing power
Elon Musk at the World Government Summit 2025 in Dubai
Musk claims Grok 3 is 'outperforming' rivals, with full release imminent
ExaAI's H200 cluster
Exacluster reveals one of the industry's first clusters based on Nvidia's H200 Hopper GPUs for AI and HPC: 192 96-core CPUs
Nvidia RTX AI PCs and Generative AI
DeepSeek might not be as disruptive as claimed, firm reportedly has 50,000 Nvidia GPUs and spent $1.6 billion on buildouts
Grok 3 launch stream
Elon Musk's Grok 3 is now available, beats ChatGPT in some benchmarks — LLM took 10x more compute to train versus Grok 2
Latest in Artificial Intelligence
ChatGPT Security
Some ChatGPT users are addicted and will suffer withdrawal symptoms if cut off, say researchers
Ant Group headquarters
Ant Group reportedly reduces AI costs 20% with Chinese chips
Nvidia
U.S. asks Malaysia to 'monitor every shipment' to close the flow of restricted GPUs to China
Ryzen AI
AMD launches Gaia open source project for running LLMs locally on any PC
Intel CEO at Davos
At Nvidia's GTC event, Pat Gelsinger reiterated that Jensen 'got lucky with AI,' Intel missed the boat with Larrabee
Nvidia
Nvidia unveils DGX Station workstation PCs with GB300 Blackwell Ultra inside
Latest in News
Inspur
US expands China trade blacklist, closes susidiary loopholes
Qualcomm
Qualcomm launches global antitrust campaign against Arm — accuses Arm of restricting access to technology
Nvidia Ada Lovelace and GeForce RTX 40-Series
Analyst claims Nvidia's gaming GPUs could use Intel Foundry's 18A node in the future
Core Ultra 200S CPU
An Arrow Lake refresh may still be in the cards with only K and KF models, claims leaker
RX 9070 XT Sapphire
Lisa Su says Radeon RX 9070-series GPU sales are 10X higher than its predecessors — for the first week of availability
RTX 5070, RX 9070 XT, Arc B580
These are the best GPU 'deals' based on real-world scalper pricing and our FPS per dollar test results
  • Mindstab Thrull
    Just for lulz...
    Imagine what would happen if he went with Intel for GPUs for this project.
    The Foundry would be very busy for quite a while!
    Reply
  • Co BIY
    Intel could handle the scale if their hardware was ready.

    I'm pretty sure they have made their pitch already.
    Reply
  • bit_user
    Mindstab Thrull said:
    Just for lulz...
    Imagine what would happen if he went with Intel for GPUs for this project.
    The Foundry would be very busy for quite a while!
    The Intel product he'd almost certainly use would be Gaudi 3, not their GPUs. However, Gaudi 3 isn't even competitive with H100 and at least some of the tiles are still made by TSMC. Not only that, but it also relies on HBM, which is in critically short supply.

    So, using Gaudi doesn't avoid the TSMC or HBM production bottlenecks and might not even save that much money at these volumes, in spite of what Intel has claimed.
    Reply