An EPYC Miss? Microsoft Azure Instances Pair AMD's MI300X With Intel's Sapphire Rapids
Genoa sidelined by Sapphire Rapids.
Microsoft's new AI-focused Azure servers are powered by AMD's MI300X datacenter GPUs, but are paired with Intel's Xeon Sapphire Rapids CPUs. AMD's flagship fourth-generation EPYC Genoa CPUs are powerful, but Sapphire Rapids appears to have a couple of key advantages when it comes to pushing along AI compute GPUs. It's not just Microsoft choosing Sapphire Rapids either, as Nvidia also seems to prefer it over AMD's current-generation EPYC chips.
There are likely several factors that convinced Microsoft to go with Intel's Sapphire Rapids instead of AMD's Genoa, but Intel's support for its Advanced Matrix Extensions (or AMX) instructions could be among the important reasons Microsoft tapped Sapphire Rapids. According to Intel, these instructions are tailored towards accelerating AI and machine learning tasks by up to seven times.
While Sapphire Rapids isn't particularly efficient and has worse multi-threaded performance than Genoa, its single-threaded performance is quite good for some workloads. This isn't something that only helps AI workloads specifically; it's just an overall advantage in some types of compute.
It's also worth noting that servers using Nvidia's datacenter-class GPUs also go with Sapphire Rapids, including Nvidia's own DGX H100 systems. Nvidia's CEO Jensen Huang said the "excellent single-threaded performance" of Sapphire Rapids was a specific reason why he wanted Intel's CPUs for the DGX H100 rather than AMD's.
The new Azure instances also feature Nvidia's Quantum-2 CX7 InfiniBand switches, bringing together the hardware of all three tech giants. That just goes to show that in the cutting-edge world of AI, companies just want the overall best hardware for the job and aren't particularly picky about who makes it, regardless of rivalries.
With eight MI300X GPUs containing 192GB of HBM3 memory each, these AI-oriented Azure instances offer a combined 1,536GB of VRAM, which is crucial for training AI. All this VRAM was likely a big reason why Microsoft selected MI300X instead of Nvidia's Hopper GPUs. Even the latest and greatest H200 chip only has 141GB of HBM3e per GPU, a significantly lower amount than the MI300X.
Microsoft also praised AMD's open-source ROCm software. AMD has been hard at work bringing ROCm to parity with Nvidia's CUDA software stack, which largely dominates professional and server graphics. That Microsoft is putting its faith in ROCm is perhaps a sign that AMD's hardware-software ecosystem is improving rapidly.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Matthew Connatser is a freelancing writer for Tom's Hardware US. He writes articles about CPUs, GPUs, SSDs, and computers in general.
-
bit_user Not sure AMX is a big driver, since any processing where that would provide a big advantage would better be done on the GPU.Reply
Single-threaded performance is a plausible explanation, but I also wonder if it has anything to do with Intel's Xeon Max models (i.e. which include up to 64 GB of HBM). Or, maybe it's just that too many customers still have Intel-centric middleware / management infrastructure.
It'd sure be interesting to know, since Genoa X would seem a natural fit, with all its cores, L3 cache, memory bandwidth, and PCIe lanes. -
2Be_or_Not2Be I'd be far more willing to believe that Nvidia took the business decision of using Intel CPUs instead of AMD EPYC CPUs w/their DGX systems, just so they don't benefit their competitor. EPYC mostly beats Sapphire Rapids on a perf/watt basis, so don't think that's a detractor. Intel's instruction set doesn't seem like it's the big reason either, if the AI loads are running primarily on the GPUs.Reply -
thestryker There are a lot of specific reasons to use one over the other, but I wonder how many were already planned before Intel had the late vulnerability which required them to delay SPR release and re-ramp high volume.Reply
It's not this for Microsoft they're using Xeon 8480C CPUs which is likely a semi-custom design for them.bit_user said:but I also wonder if it has anything to do with Intel's Xeon Max models (i.e. which include up to 64 GB of HBM). -
phitinh81 The main reason is MSFT got great deal on these Xeon, same as Nvidia did with DGX system. People running these VM for GPU intensive tasks like training AI models, top tier CPU is a waste. This is basic knowledge. Twisting it to Intel's favor is silly.Reply -
cyrusfox
How is it twisting it? For Nvidia clear decision, for Microsoft who is already using AMD GPU, its a bit perplexing to put it on Intel. This is a rare win for Intel on the server/datacenter side so an article like this is warranted.phitinh81 said:The main reason is MSFT got great deal on these Xeon, same as Nvidia did with DGX system. People running these VM for GPU intensive tasks like training AI models, top tier CPU is a waste. This is basic knowledge. Twisting it to Intel's favor is silly.
The reason they both chose Intel I imagine is due to a combination of price(economics) and platform support/stability rather than it being feature or performance base for what is essentially AI heavy machines. AMD does not care to lower margins to compete, and they likely don't need to , probably supply constrained on more lucrative datacenter contracts. -
phitinh81
Intel is selling Xeon at or below cost. Their financial report on Data Center is hard proof. This article questioning the choice MSFT & Nvidia made without pointing out the obvious: price & function of CPU on these machines. If that is not twisting or should i say misleading ? Article of course warranted as a moral boost Intel's server business badly needed. As always :)cyrusfox said:How is it twisting it? For Nvidia clear decision, for Microsoft who is already using AMD GPU, its a bit perplexing to put it on Intel. This is a rare win for Intel on the server/datacenter side so an article like this is warranted.
The reason they both chose Intel I imagine is due to a combination of price(economics) and platform support/stability rather than it being feature or performance base for what is essentially AI heavy machines. AMD does not care to lower margins to compete, and they likely don't need to , probably supply constrained on more lucrative datacenter contracts. -
TerryLaze
AMD has barely a 25% market share, on revenue, compared to intel, rare is...not that.cyrusfox said:This is a rare win for Intel on the server/datacenter side so an article like this is warranted.
https://www.tomshardware.com/pc-components/cpus/amd-comes-roaring-back-gains-market-share-in-laptops-pcs-and-server-cpus
And yet they reduced their margins on data center by a huge amount, the operating income dropped to less than half of what it was compared to last year.cyrusfox said:AMD does not care to lower margins to compete, and they likely don't need to , probably supply constrained on more lucrative datacenter contracts.
This is the 9 months ending comparison, almost the same revenue but far less actual money they made from that, that is called lower margin.
https://ir.amd.com/news-events/press-releases/detail/1163/amd-reports-third-quarter-2023-financial-resultsSegment and Category Information(1)
Data Center
Net revenue
$4,214
$4,388Operating income
$601
$1,404 -
bit_user
It's not clear how much of that was due to price reductions vs. cost increases. The list price of Genoa models extends higher than Milan, so even having the same revenue suggests some loss of volume.TerryLaze said:And yet they reduced their margins on data center by a huge amount, the operating income dropped to less than half of what it was compared to last year. -
NinoPino
No company sell actual products below cost, it is suicidal. May be low margins.phitinh81 said:Intel is selling Xeon at or below cost..
For the rest I agree with your considerations.
phitinh81 said:Their financial report on Data Center is hard proof. This article questioning the choice MSFT & Nvidia made without pointing out the obvious: price & function of CPU on these machines. If that is not twisting or should i say misleading ? Article of course warranted as a moral boost Intel's server business badly needed. As always :)