Microsoft Azure CTO claims distribution of AI training is needed as AI datacenters approach power grid limits

Microsoft
(Image credit: Microsoft)

The rapid expansion of generative AI models require more powerful hardware and with the rise of datacenters with hundreds of thousands of AI GPUs, they are quickly pushing the limits of current datacenter infrastructure and soon they could hit the limit of power grid. While AWS, Microsoft, and Oracle plan to use nuclear power plants to power their datacenters, Microsoft Azure's chief technology officer, Mark Russinovich, suggests that connecting multiple datacenters may soon be necessary to train advanced AI models, reports Semafor

Modern AI datacenters, such as those built by Elon Musk's companies Tesla or xAI, can house 100,000 of Nvidia H100 or H200 GPUs and as American giants are competing to train the industry's best AI models, they are going to need even more AI processors that work in concert as a unified system. As a consequence, datacenters are becoming even more power hungry both due to the increased number of processors, higher power consumption of these processors, and the amount of power that is required for their cooling. As a result, datacenters consuming multiple gigawatts of power could soon become real. But the U.S. energy grid is already under strain, especially during periods of high demand, such as in hot summer days, there are concerns that the grid may not be able to keep up with the demand. 

To address these challenges, Microsoft is making significant investments in energy infrastructure. Recently the company signed a deal to reopen the Three Mile Island nuclear power plant to secure a more stable energy supply and before threat the company invested tens of billions of dollars in development of AI infrastructure development. But that may not be enough and at some point huge companies will have to connect multiple datacenters to train their most sophisticated models, says Microsoft Azure CTO. 

"I think it is inevitable, especially when you get to the kind of scale that these things are getting to," Russinovich told Semafor. "In some cases, that might be the only feasible way to train them is to go across datacenters, or even across regions. […] I do not think we are too far away."  

On paper, this approach would address the growing strain on power grids and overcome technical challenges associated with centralized AI training. However, this strategy comes with major technical challenges, particularly in ensuring that datacenters remain synchronized and maintain high communication speeds required for effective AI training. 

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.