Nvidia Details Volta GV100 GPU, Tesla V100 Accelerator

Nvidia pulled back part of the curtain to reveal its long-anticipated Volta GPU architecture, showing the GV100 GPU and the first derivative product, the Tesla V100, here at GTC in San Jose. Nvidia first dropped the Volta name at GTC in 2013, and it's taken the company four years to get to the juicy details. If you're a gamer, don't get too excited yet, though; Nvidia is still pitching Pascal-derived products (only a year old, or less) for you. If you work in the AI and high performance computer (HPC) markets, however, this first phase of Volta is coming your way.

The Volta GV100 GPU Architecture

The Volta GV100 GPU uses the 12nm TSMC FFN process, has over 21 billion transistors, and is designed for deep learning applications. We're talking about an 815mm2 die here, which pushes the limits of TSMC's current capabilities. Nvidia said it's not possible to build a larger GPU on the current process technology. The GP100 was the largest GPU that Nvidia ever produced before the GV100. It took up a 610mm2 surface area and housed 15.3 billion transistors. The GV100 is more than 30% larger.

Volta’s full GV100 GPU sports 84 SMs (each SM features four texture units, 64 FP32 cores, 64 INT32 cores, 32 FP64 cores) fed by 128KB of shared L1 cache per SM that can be configured to varying texture cache and shared memory ratios. The GP100 featured 60 SMs and a total of 3,840 CUDA cores. The Volta SMs also feature a new type of core that specializes in Tensor deep learning 4x4 Matrix operations. The GV100 contains eight Tensor cores per SM and deliver a total of 120 TFLOPS for training and inference operations. To save you some math, this brings the full GV100 GPU to an impressive 5,376 FP32 and INT32 cores, 2688 FP64 cores, and 336 texture units.

Like the GP100, we get two SMs per TPC; 42 TPC overall in the GV100. And that rolls up into six GPCs.

GV100 also features four HBM2 memory emplacements, like the GP100, with each stack controlled by a pair of memory controllers. Speaking of which, there are eight 512-bit memory controllers (giving this GPU a total memory bus width of 4,096-bit). Each memory controller is attached to 768KB of L2 cache, for a total of 6MB of L2 cache (vs 4MB for Pascal).


Tesla V100

The new Nvidia Tesla V100 features 80 SMs for a total of 5,120 CUDA cores. However, it has the potential to reach 7.5, 15, and 120 TFLOPs in FP64, FP32, and Tensor computations, respectively.

The Tesla V100 sports 16GB of HBM2 memory, which is capable of reaching up to 900 GB/s. The Samsung memory that Nvidia installed on the Tesla V100 is also 180 GB/s faster than the memory found on the Tesla P100 cards. Nvidia said it used the fastest memory available on the market.

The Tesla V100 also introduces the second generation of NVLink, which allows for up to 300 GB/s over six 25 GB/s NVLinks per GPU. 

To put those numbers into perspective, Nvidia's Pascal-derived Tesla P100 sports 56 SMs and 3,584 CUDA cores, which produce up to 5.3 TFLPs in FP64 computations, and 10.6 TFLOPs in FP32 computations. The V100 offers a full 30% more FP32 computational capability than the P100, and nearly a 50% increase in FP64 performance. And Nvidia increased the NVLink bandwidth of the Tesla V100 by 50% by adding two NVLinks per GPU compared to the Tesla P100, and by increasing the bandwidth of each NVLink by 5GB/s. 

Nvidia said the Tesla V100 carries a TDP of 300W, which is the same power requirement as the Tesla P100.


V100
P100

SMs

80

56

Cores

- 5,120 (FP32)

- 2,560 (FP64)

- 3,584 (FP32)

- 1,792 (FP64)

Boost Clock

1,455MHz

1,480MHz

TFLOPs

- 7.5 (FP64)

- 15 (FP32)

- 120 Tensor

- 5.3 (FP64)

- 10.3 (FP32)

Texture Units

320


224

Memory

16GB 4096-bit HBM2

16GB 4096-bit HBM2

Data Rate

900 GB/s

720 GB/s

Transistors

21.1 Billion


15.3 Billion

Manufacturing Process

12nm FFN

16nm FinFET+
Create a new thread in the News comments forum about this subject
This thread is closed for comments
17 comments
Comment from the forums
    Your comment
  • bit_user
    Quote:
    The GV100 contains eight Tensor cores per SM, and each core delivers up to 120 TFLOPS
    That can't be right.

    Each tensor core generates 64 FMAs per clock, translating to 128 FLOPs. At 8 tensor cores per SM, you get 1024 FLOPs per SM, which works out to 1 TFLOPS @ 1 GHz. Now, if we assume it was 120 TFLOPS for the entire GPU, then that would yield a very reasonable clock speed of 1.43 GHz, assuming all 84 SMs were enabled, or 1.5 GHz assuming 80 SMs.

    That's awfully impressive. I'd bet we're not talking about IEEE 754-compliant floating point, here. They must've cut some things besides denormals to get that much speed up over their normal SIMD units.

    Quote:
    second generation of NVLink ... allows for up to 300 GB/s over six 25GB/s NVLinks per GPU.
    Because it's bidir, and they're counting each direction.
    0
  • PaulAlcorn
    Anonymous said:
    Quote:
    each core delivers up to 120 TFLOPS
    That can't be right.

    Each tensor core generates 64 FMAs per clock, translating to 128 FLOPs. At 8 tensor cores per SM, you get 1024 FLOPs per SM, which works out to 1 TFLOPS @ 1 GHz. Now, if we assume it was 120 TFLOPS for the entire GPU, then that would yield a very reasonable clock speed of 1.43 GHz, assuming all 84 SMs were enabled.


    You're right, it's 120 peak Tensor core TFLOPs for the total card. With the caveat that Peak TFLOP/s rates are based on GPU Boost clock. Fixed :)
    1
  • bit_user
    Anonymous said:
    ...

    Go Chiefs!
    (I used to live there.)
    0