Nvidia pulled back part of the curtain to reveal its long-anticipated Volta GPU architecture, showing the GV100 GPU and the first derivative product, the Tesla V100, here at GTC in San Jose. Nvidia first dropped the Volta name at GTC in 2013, and it's taken the company four years to get to the juicy details. If you're a gamer, don't get too excited yet, though; Nvidia is still pitching Pascal-derived products (only a year old, or less) for you. If you work in the AI and high performance computer (HPC) markets, however, this first phase of Volta is coming your way.
The Volta GV100 GPU Architecture
The Volta GV100 GPU uses the 12nm TSMC FFN process, has over 21 billion transistors, and is designed for deep learning applications. We're talking about an 815mm2 die here, which pushes the limits of TSMC's current capabilities. Nvidia said it's not possible to build a larger GPU on the current process technology. The GP100 was the largest GPU that Nvidia ever produced before the GV100. It took up a 610mm2 surface area and housed 15.3 billion transistors. The GV100 is more than 30% larger.
Volta’s full GV100 GPU sports 84 SMs (each SM features four texture units, 64 FP32 cores, 64 INT32 cores, 32 FP64 cores) fed by 128KB of shared L1 cache per SM that can be configured to varying texture cache and shared memory ratios. The GP100 featured 60 SMs and a total of 3,840 CUDA cores. The Volta SMs also feature a new type of core that specializes in Tensor deep learning 4x4 Matrix operations. The GV100 contains eight Tensor cores per SM and deliver a total of 120 TFLOPS for training and inference operations. To save you some math, this brings the full GV100 GPU to an impressive 5,376 FP32 and INT32 cores, 2688 FP64 cores, and 336 texture units.
Like the GP100, we get two SMs per TPC; 42 TPC overall in the GV100. And that rolls up into six GPCs.
GV100 also features four HBM2 memory emplacements, like the GP100, with each stack controlled by a pair of memory controllers. Speaking of which, there are eight 512-bit memory controllers (giving this GPU a total memory bus width of 4,096-bit). Each memory controller is attached to 768KB of L2 cache, for a total of 6MB of L2 cache (vs 4MB for Pascal).
The new Nvidia Tesla V100 features 80 SMs for a total of 5,120 CUDA cores. However, it has the potential to reach 7.5, 15, and 120 TFLOPs in FP64, FP32, and Tensor computations, respectively.
The Tesla V100 sports 16GB of HBM2 memory, which is capable of reaching up to 900 GB/s. The Samsung memory that Nvidia installed on the Tesla V100 is also 180 GB/s faster than the memory found on the Tesla P100 cards. Nvidia said it used the fastest memory available on the market.
The Tesla V100 also introduces the second generation of NVLink, which allows for up to 300 GB/s over six 25 GB/s NVLinks per GPU.
To put those numbers into perspective, Nvidia's Pascal-derived Tesla P100 sports 56 SMs and 3,584 CUDA cores, which produce up to 5.3 TFLPs in FP64 computations, and 10.6 TFLOPs in FP32 computations. The V100 offers a full 30% more FP32 computational capability than the P100, and nearly a 50% increase in FP64 performance. And Nvidia increased the NVLink bandwidth of the Tesla V100 by 50% by adding two NVLinks per GPU compared to the Tesla P100, and by increasing the bandwidth of each NVLink by 5GB/s.
Nvidia said the Tesla V100 carries a TDP of 300W, which is the same power requirement as the Tesla P100.
|Cores||- 5,120 (FP32)- 2,560 (FP64)||- 3,584 (FP32) - 1,792 (FP64)|
|TFLOPs||- 7.5 (FP64)- 15 (FP32)- 120 Tensor||- 5.3 (FP64) - 10.3 (FP32)|
|Memory||16GB 4096-bit HBM2||16GB 4096-bit HBM2|
|Data Rate||900 GB/s||720 GB/s|
|Transistors||21.1 Billion||15.3 Billion|
|Manufacturing Process||12nm FFN||16nm FinFET+|