Microsoft Azure CTO claims distribution of AI training is needed as AI datacenters approach power grid limits

Microsoft
(Image credit: Microsoft)

The rapid expansion of generative AI models require more powerful hardware and with the rise of datacenters with hundreds of thousands of AI GPUs, they are quickly pushing the limits of current datacenter infrastructure and soon they could hit the limit of power grid. While AWS, Microsoft, and Oracle plan to use nuclear power plants to power their datacenters, Microsoft Azure's chief technology officer, Mark Russinovich, suggests that connecting multiple datacenters may soon be necessary to train advanced AI models, reports Semafor

Modern AI datacenters, such as those built by Elon Musk's companies Tesla or xAI, can house 100,000 of Nvidia H100 or H200 GPUs and as American giants are competing to train the industry's best AI models, they are going to need even more AI processors that work in concert as a unified system. As a consequence, datacenters are becoming even more power hungry both due to the increased number of processors, higher power consumption of these processors, and the amount of power that is required for their cooling. As a result, datacenters consuming multiple gigawatts of power could soon become real. But the U.S. energy grid is already under strain, especially during periods of high demand, such as in hot summer days, there are concerns that the grid may not be able to keep up with the demand. 

To address these challenges, Microsoft is making significant investments in energy infrastructure. Recently the company signed a deal to reopen the Three Mile Island nuclear power plant to secure a more stable energy supply and before threat the company invested tens of billions of dollars in development of AI infrastructure development. But that may not be enough and at some point huge companies will have to connect multiple datacenters to train their most sophisticated models, says Microsoft Azure CTO. 

"I think it is inevitable, especially when you get to the kind of scale that these things are getting to," Russinovich told Semafor. "In some cases, that might be the only feasible way to train them is to go across datacenters, or even across regions. […] I do not think we are too far away."  

On paper, this approach would address the growing strain on power grids and overcome technical challenges associated with centralized AI training. However, this strategy comes with major technical challenges, particularly in ensuring that datacenters remain synchronized and maintain high communication speeds required for effective AI training. 

The communication between thousands of AI processors within a single datacenter is already a challenge, and spreading this process across multiple sites only adds complexity. Advances in fiber optic technology have made long-distance data transmission faster, but managing this across multiple locations remains a significant hurdle. To mitigate these issues, Russinovich suggests that datacenters in a distributed system would need to be relatively close to one another. Also, implementing this multi-datacenter approach would require collaboration across multiple teams within Microsoft and its partner OpenAI, which means that decentralized AI training methods must be developed within Microsoft.  

There is a catch about decentralized AI training methods. Once developed, they offer a potential solution for reducing reliance on the most advanced GPUs and large-scale datacenters. This could lower the barrier to entry for smaller companies and individuals looking to train AI models without the need for massive computational resources. Interestingly, but Chinese researchers have already used decentralized methods to train their AI models across multiple datacenters. However, the details are scarce.

TOPICS
Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • hotaru251
    or wait to progress ai until we have a court case over if training data on others use is legal or not.

    be a shame if they spent all this on ai then courts rule in favor of the people and the training they did has to be scrapped.

    edit: or ya know...start requiring the major users of the power to be required to fund opening nuclear plants to take burden off the grid and benefit everyone long term
    Reply
  • rluker5
    So where is the pot of gold at the end of this rainbow?
    Maybe they will make AI fast enough to crack encryption and loot? There would be money in that. More monitoring and better targeted ads? I don't think there is enough money left on the table there to pay for all of this. Is it just riding a hype wave to boost equity values? That could be worth it in the short term but is really just a waste of resources.

    Maybe somebody else has something better.
    Reply
  • Pierce2623
    So companies are still throwing unlimited money at this with no indications of where the monetization is going to come from?
    Reply