New Intel documents shared via Komachi_Ensaka have emerged, providing more details on Intel's upcoming "Ponte Vecchio" Xe-HPC graphics card in its Open Accelerator Module (OAM) presentation. As opposed to the PCIe form factor, the OAM is perfect for environments where scalability is a top priority.
Ponte Vecchio isn't a small GPU by any means, even though it can fit in the palms of your hands. With over 100 billion transistors, Ponte Vecchio is comprised of up to 47 tiles (or chiplets -- whatever you want to call them.) It holds 16 Xe HPC compute tiles, eight Rambo cache tiles, two Xe base tiles, 11 EMIB links, two Xe Link I/O tiles and eight HBM stacks.
While Intel has teased that Ponte Vecchio delivers up to 1 PetaFLOP of pure performance, the chipmaker has kept the finer details under wraps. The leaked documents add one more piece to the puzzle: Ponte Vecchio's thermal design power (TDP).
According to the documents shared via Komachi_Ensaka, a respected hardware leaker, Intel will offer Ponte Vecchio as a single OAM that's rated for 600W. It's a pretty substantial TDP, which would explain the need for liquid cooling. At this time, it's uncertain if Intel will offer Ponte Vecchio with lower thermal requirements, allowing for air cooling.
Unlike the best graphics cards for gaming, the Xe-HPC is all about high-performance computing and will debut in Argonne National Laboratory's upcoming exascale Aurora supercomputer. The supercomputer, which is valued up to $500 million, features over 9,000 nodes whereby each node leverages a pair of core-heavy Intel Xeon Scalable "Sapphire Rapids" processors and six Ponte Vecchio graphics cards. Intel never specified if Aurora will employ the Ponte Vecchio 600W OAM, but given the magnitude of the infrastructure, it probably will.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Zhiye Liu is a news editor and memory reviewer at Tom’s Hardware. Although he loves everything that’s hardware, he has a soft spot for CPUs, GPUs, and RAM.
-
thGe17 And ...?Reply
The current leading Top500-system uses 29,9 MWatts and delivers not even half the performance of the Aurora.
Currently a lot of details are missing, but if you compare 1 PFlops FP16 (most likely) at 600 W against nVidia's top-selling card A100 with 0,31 PFlops FP16 at 400 W, the Intel design is much more efficient.
Intel Xe-HPC, 600 W (presumably): 1,66 TFlops/Watt
nVidia A100, 400 W: 0,78 TFlops/Watt
AMD Instinct MI100, 300 W: 0,62 TFlops/Watt
RTX 3090, 350 W: 0,81 TFlops/Watt
RX 6900 XT, 300 W: 0,15 TFlops/Watt (no MMA as far as I know) -
ptmmac I estimated $150,000 a day in electrical usage which means more like $250,000 a day to operate including staff and building.Reply
If I am in the ball park this will cost about the same to operate as it does to purchase if it is used for 6 years. It will probably be
upgraded a good bit during its lifetime, so maybe $1.5 billion to build, update and operate it for that long.
It is hard to conceive of something better to be spending tax dollars on then this kind of basic research infrastructure.
Even with Moores law slowing appreciably for CPU's, the GPU compute modules are taking up the slack and pushing cutting
edge research forward at almost the same pace. -
everettfsargent
"One petaflops is equal to 1,000 teraflops, or 1,000,000,000,000,000 FLOPS."thGe17 said:And ...?
The current leading Top500-system uses 29,9 MWatts and delivers not even half the performance of the Aurora.
Currently a lot of details are missing, but if you compare 1 PFlops FP16 (most likely) at 600 W against nVidia's top-selling card A100 with 0,31 PFlops FP16 at 400 W, the Intel design is much more efficient.
Intel Xe-HPC, 600 W (presumably): 1,66 TFlops/Watt
nVidia A100, 400 W: 0,78 TFlops/Watt
AMD Instinct MI100, 300 W: 0,62 TFlops/Watt
RTX 3090, 350 W: 0,81 TFlops/Watt
RX 6900 XT, 300 W: 0,15 TFlops/Watt (no MMA as far as I know)
"FLOPS can be recorded in different measures of precision, for example, the TOP500 supercomputer list ranks computers by 64 bit (double-precision floating-point format) operations per second, abbreviated to FP64. Similar measures are available for 32-bit (FP32) and 16-bit (FP16) operations. "
There are a bunch of CPU's which use a whole lot of watts that you have sort of left out of your calculations.
We are currently about to enter exaFLOPS (10^18 ) territory using FP64 as the metric, like it has been so for several decades now. The A100 does 9.7 teraFLOPS of FP64 compute which converts to 0.0097 petaFLOPS of FP64 compute. If you want to believe Intel's marketing of so-called one petaFLOP (~100X faster than the A100) then I have a bridge to sell you located on some prime Louisiana beachfront property.
https://en.wikipedia.org/wiki/TOP500https://en.wikipedia.org/wiki/FLOPS
Bad rumors start from bad rumor sites/blogs/tweets like this one.
"On 18 March 2019, the United States Department of Energy and Intel announced the first exaFLOPS supercomputer would be operational at Argonne National Laboratory by the end of 2021. The computer, named Aurora is to be delivered to Argonne by Intel and Cray (now Hewlett Packard Enterprise), and is expected to use Intel Xe GPGPUs alongside a future Xeon Scalable CPU, and cost US$600 Million.
On 7 May 2019, the U.S. Department of Energy announced a contract with Cray (now Hewlett Packard Enterprise) to build the Frontier supercomputer at Oak Ridge National Laboratory. Frontier is anticipated to be operational in 2021 and, with a performance of greater than 1.5 exaFLOPS, should then be the world's most powerful computer.
On 4 March 2020, the U.S. Department of Energy announced a contract with Hewlett Packard Enterprise and AMD to build the El Capitan supercomputer at a cost of US$600 million, to be installed at the Lawrence Livermore National Laboratory (LLNL). It is expected to be used primarily (but not exclusively) for nuclear weapons modeling. El Capitan was first announced in August 2019, when the DOE and LLNL revealed the purchase of a Shasta supercomputer from Cray. El Capitan will be operational in early 2023 and have a performance of 2 exaFLOPS. It will use AMD CPUs and GPUs, with 4 Radeon Instinct GPUs per EPYC Zen 4 CPU, to speed up artificial intelligence tasks. El Capitan should consume around 40 MW of electric power."
https://en.wikipedia.org/wiki/Exascale_computing
Do you still think that Intel has a so-called one petaFLOP GPU or some such (e. g. compute unit or CU)? How does AMD/Nvidia still manage to win contracts for even bigger and faster supercomputers in the the face of Intel's so-called one petaFLOP CU?
I am on a roll ...
https://en.wikipedia.org/wiki/Aurora_(supercomputer)
So > 9,000 CPU nodes "with each node being composed of two Intel Xeon Sapphire Rapids processors, six Xe GPU's " is at least 6*9000 = at least 54,000 Xe's ...
Raja Koduri Teases "Petaflops in Your Palm" Intel Xe-HPC Ponte Vecchio GPU ...
https://www.techpowerup.com/280106/raja-koduri-teases-petaflops-in-your-palm-intel-xe-hpc-ponte-vecchio-gpu
So, "Petaflops in Your Palm" would translate into at least two petaFLOPS of FP64 compute, but since Intel marketing always lies, we will go with FP16, so each Xe has at least one half petaFLOPS of FP64 compute (e. g. at least two petaFLOPS of FP16 compute) ... so 0.5 *54000 = 27 exaFLOPS of FP64 compute for just the Xe's! Someone is lying to you and it is not the DOE or the DOD.
Straight from the proverbial horse's mouth (2021-03-25) ...
https://www.alcf.anl.gov/news/preparing-exascale-aurora-supercomputer-help-scientists-visualize-spread-cancer
"The U.S. Department of Energy’s (DOE) Argonne National Laboratory will be home to one of the nation’s first exascale supercomputers when Aurora arrives in 2022. ... Randles is one of a select few researchers chosen to take part in the ALCF’s Aurora Early Science Program (ESP). Her project will be among the first to run on Aurora, which will be delivered to Argonne in 2022. "
https://www.anl.gov/article/preparing-for-exascale-aurora-supercomputer-to-help-scientists-visualize-the-spread-of-cancer
So the actual date is sometime in 2022. -
barryv88 When AMD uses chiplets, Intel mocks them by calling their method "glue".Reply
When Intel does the same, all is fine. Double standards much? Or just a bunch of hypocrites that figured out that the enemies' methods just work far better? I wonder. I wonder... -
thGe17 everettfsargent said:"One petaflops is equal to 1,000 teraflops, or 1,000,000,000,000,000 FLOPS."
"FLOPS can be recorded in different measures ...
Is there a point in your "story" or is this simply complete nonsense?
I explicitly wrote something of FP16 performance (half precision) because of "a massive processor featuring 47 components, over 100 billion transistors, and offering PetaFLOPS-class AI performance".
Whether this claim is correct or not is a completely different story, but it's not unlikely, because already nVidia's A100 offers with a monolithic 826 mm2 chip 0,31 PFlops (~ 1/3 PFlops) via their Tensor Cores v3 and MMA-operations.
And I measured/compared all exemplary designs correctly according to their best FP16 performance in my initial post. So, what didn't you understand?
"Do you still think that Intel has a so-called one petaFLOP GPU ..."
Of course I do, because you simply haven't understood what I was writing about. There's nothing wrong with the numbers and it is also very likely that such a big design even exceeds 1,0 PFlops FP16 performance with MMA-operations. (Btw., its funny how much you emphasize the "s" in "Petaflops in Your Palm"; additionally, this claim is already fulfilled if it would be only 1,1 PFlops. ;))
And no, nobody is lying to me (or anybody else). The problem here is that you mess up different workloads, types of calculations and units.
And maybe you missed the fact during your misguided quotation-orgy: The Top500 system Summit uses Volta-cards (GV100), which only provide Tensor Cores v1 but still deliver 0,13 PFlops FP16 performance for AI workloads. The system has over 27,600 GPUs in total and therefore a combined/theoretical (GPU-only) performance of ~ 3,450 PetaFlops or ~3.5 ExaFlops already!
With its current Rmax value of ~ 149 PFlops, the system is still no. 2 in the Top500 list, but again, this is FP64 performace, not FP16 performance, which I referred to and obviously also R. Koduri.
And here also, nVidia's own DGX-supercomputer Selene (currently no. 5 in the Top500) uses the latest A100 cards. It has a Rmax of ~ 63 PFlops, but: "By that metric, using the A100’s 3rd generation tensor core, Selene delivers over 2,795 petaflops, or nearly 2.8 exaflops, of peak AI performance." **)
To do the math, with a lot of uncertainty, because a lot of assumptions are made:
approx. 9000+ nodes for the Aurora
1 node with 6x Xe-HPC (and 2x Xeon)
assuming only 1 PFlops FP16 per Ponte Vecchio
also assuming this big 600 W-version is used ***)
therefore about 4+ kW per nodeResulting in:
36+ MW
54,000 PFlops or about 54 ExaFlops FP16/AI performance via GPGPUs
about 1,5 ExaFlops/MW (the Selene achieves about 1,08 ExaFlops/MW, the much older Summit achives only about 0,34 ExaFlops/MW)Estimations for Rmax and the Top500 ranking are more problematic, because Intel hasn't disclosed any FP64 performance numbers for HPC and the size of a single HPC-compute tile is unknown, therefore it doesn't make sense to try to extrapolate from Xe-HP with about 10 TFlops FP32 performance for a single compute tile. (Additionally the composition of function units in HPC and HP most likely will differ considerably.)
But we can reverse the process and assume that the system will reach (only exactly) 1 ExaFlops FP64 performance.
If we ignore the Xeon's, we have about 54,000 Xe-HPC-packages/sockets in use and to reach this goal a single "chip" only has to achieve about 18.5 TFlops FP64, which is already in the range of today's hardware and therefore nothing special. In fact it is more likely that such a massive "chip" will already achieve even more and also the whole system will most likely exceed 1,0 ExaFlops FP64 (Rmax) performance.
*) Note: The Instinct MI100 has 11,5 TFlops FP64 peak performance. An A100 has 9,7 TFlops, but for the Ampere-Design this is only half the truth because the chip has additional FP64 functionality inside the Tensor Cores v3. (With FP64-MMA-operations, the A100 can reach theoretically up to 19,5 TFlops.)
**) I assume nVidia calculates the performance in this quote with the sparsity feature in mind. Without it (therefore basic FP16/bfloat16 performance via Tensor Cores v3) the system should achieve up to 1,4 ExaFlops. Still quiet impressive for this relatively small system. (The system only uses about 4480 A100.)
***) Intel already stressed the fact, that they are quiet flexible according to Ponte Vecchio-like designs because of Foveros/EMIB, therefore they can custom-tailor different designs for different customers and use cases. For example it may be possible to also provide an AI- or FP64-only design with much more performance instead of a genereal purpose/all-in-one design. -
everettfsargent "54,000 PFlops or about 54 ExaFlops AI performance via GPGPUs"Reply
Aurora,. in total, will barely manage one exaFLOP in FP64 . You need to stop lying. -
thGe17 Wow, you managed to demonstrate your dyslexia already two times now ... or do you have another problem? - You should learn to read properly, because you still compare apples and oranges.Reply
And to speak frankly, it is quiet impertinent to accuse me of lying, when obviously you do not unterstand simple facts, although described extensively.
The Aurora will be an ExaFlops system and therefore it will achieve this performance not only theoretically but according to Rmax in the Top500 (simply because the contract demands at least a 1 ExaFlops FP64/HPC system).
And as stated before, the currently available and much older systems already exceed FP16/AI performance of 1 ExaFlops easily, therefore it is quiet obvious, that that a new system with a completely new architecutre and using modern lithography will surpass these performance values by far, therefore the "54 ExaFlops FP16/AI"-performance are very likely and maybe even estimated too conservatively, because already the A100 achieves 0,31 PFlops with a single, less complex, monolithic chip. -
everettfsargent I am using the only reported metric that really counts and that is FP64. Unless you can change that reported metric you can just go away.Reply