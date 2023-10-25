Demand for AI is immense these days. French firm Schneider Electric estimates that power consumption of AI workloads will total around 4.3 GW in 2023, which is slightly lower than power consumption of the nation of Cyprus (4.7 GW) was in 2021. The company anticipates that power consumption of AI workloads will grow at a compound annual growth rate (CAGR) of 26% to 36%, which suggests that by 2028, AI workloads will consume from 13.5 GW to 20 GW, which is more than what Iceland consumed in 2021.
Massive Power Requirements
In 2023, the total power consumption of all datacenters is estimated to be 54 GW, with AI workloads accounting for 4.3 GW of this demand, according to Schneider Electric. Within these AI workloads, the distribution between training and inference is characterized by 20% of the power being consumed for training purposes, and 80% allocated to inference tasks. This means that AI workloads will be responsible for approximately 8% of the total power consumption of datacenters this year.
Looking ahead to 2028, Schneider projects that the total power consumption of datacenters will escalate to 90 GW, with AI workloads consuming between 13.5 GW to 20 GW of this total. This indicates that by 2028, AI could be responsible for consuming around 15% to 20% of the total power usage of datacenters, showcasing a significant increase in the proportion of power consumed by AI workloads in datacenters over the five-year period. The distribution between training and inference is expected to shift slightly, with training consuming 15% of the power and inference accounting for 85%, according to estimates by Schneider Electric.
AI GPUs Get Hungrier
The escalating power consumption in AI datacenters is primarily attributed to the intensification of AI workloads, advancements of AI GPUs and AI processors, and increasing requirements of other datacenter hardware. For example, of Nvidia's A100 from 2020 consumed up to 400W, H100 from 2022 consumes up to 700W. In addition to GPUs, AI servers also run power-hungry CPUs and network cards.
AI workloads, especially those associated with training, necessitate substantial computational resources, including specialized servers equipped AI GPUs, specialized ASICs, or CPUs. The size of AI clusters, influenced by the complexity and magnitude of AI models, is a major determinant of power consumption. Larger AI models necessitate a more considerable number of GPUs, thereby increasing the overall energy requirements. For instance, a cluster with 22,000 H100 GPUs utilizes about 700 racks. An H100-based rack, when populated with eight HPE Cray XD670 GPU-accelerated servers, results in a total rack density of 80 kW. As a result, the whole cluster demands approximately 31 MW of power, excluding the energy required for additional infrastructural needs like cooling, Schneider Electric notes.
These clusters and GPUs are often operational at nearly full capacity throughout the training processes, ensuring that the average energy usage is almost synonymous with the peak power consumption. The document specifies that the rack densities in substantial AI clusters vary between 30 kW and 100 kW, contingent on the quantity and model of the GPU.
Network latency also plays a crucial role in the power consumption of AI datacenters. A sophisticated network infrastructure is essential to support the high-speed data communication required by powerful GPUs during distributed training processes. The necessity for high-speed network cables and infrastructures, such as those capable of supporting speeds up to 800 Gb/s, further escalates the overall energy consumption.
Given that AI workloads require power-hungry ASICs, GPUs, CPUs, network cards, and SSDs, cooling poses a major challenge. Given the high rack densities and the immense heat generated during computational processes, effective cooling solutions are imperative to maintain optimal performance and prevent hardware malfunctions or failures. Meanwhile air and liquid cooling methods are also 'expensive' in terms of power consumption, which is why they also contribute heavily to power consumption of datacenters used for AI workloads.
Some Recommendations
Schneider Electric does not expect power consumption of AI hardware to get lower anytime soon, and the company fully expects power consumption of an AI rack to get to 100 kW or higher. As such, Schneider Electric has some recommendations for datacentres specializing on AI workloads.
In particular, Schneider Electric recommends transitioning to a 240/415V distribution from the conventional 120/208V to better accommodate the high power densities of AI workloads. For cooling, a shift from air cooling to liquid cooling is advised to enhance processor reliability and energy efficiency, though immersive cooling might produce even better results. Racks used should be more capacious, with specifications such as being at least 750 mm wide and having a static weight capacity greater than 1,800 kg.
Even if it's not replacing humans, the economic benefits of AI imply that it's delivering greater efficiency or else people & businesses wouldn't be willing to pay for it. That typically means less of some sort of resource gets used in the process. So, unless we're talking about AI being used by fossil fuel companies to locate and extract more coal, oil, or gas, there should be some efficiency upside to using AI.
Then, the real question becomes how to make sure that upside is greater than the negative impact it has. And that brings us back to carbon pricing. Yes, I know it'll probably only ever happen long after renewable energy becomes dominant (i.e. due to the influence of fossil fuel lobbyists), but carbon pricing is ultimately the way to help ensure everything in the economy that still uses carbon is delivering more good than bad.
One thing that's kind of sad is that AI chips are being run far beyond their window of good efficiency. I know it's not exactly analogous, but this article showed you could get about 77% of the performance from a RTX 4090 at 50% of the power:
It's just a tool. Whether it's good or bad depends on how we use it.
The cooling and energy requirements in the actual building housing the racks are an interesting challenge that I'm sure Schneider Electric can solve for a tidy but not unreasonable sum.
It's the high price of the processors that result in them being run beyond their electrical efficiency sweet-spots. Perhaps when the producers are not as capacity constrained they can market a bigger chip but run it slower and cooler for the same performance.
When the processor/accelerator costs $40,000 then the electrical expense is negligible and will be disregarded.
And ... This is how it should be - We have many more resources to produce electricity than we have to produce semi-conductors.
And ... they are much more widely (and equitably) distributed. If we required increased electrical efficiency out of the processors rather than allow users to run them at the max it would increase the power of the Fab giants at the expense of many others.