Skip to main content

On Tensors, Tensorflow, And Nvidia's Latest 'Tensor Cores'

Nvidia announced a brand new accelerator based on the company’s latest Volta GPU architecture, called the Tesla V100. The chip’s newest breakout feature is what Nvidia calls a “Tensor Core.” According to Nvidia, Tensor Cores can make the Tesla V100 up to 12x faster for deep learning applications compared to the company’s previous Tesla P100 accelerator. (See our coverage of the GV100 and Tesla V100 here.)

Tensors And Tensorflow

A tensor is a mathematical object represented by an array of components that are functions of the coordinates of a space. Google created its own machine learning framework that uses tensors because tensors allow for highly scalable neural networks.

Google surprised industry analysts when it open sourced its Tensorflow machine learning software library, but this may have been a stroke of genius because Tensorflow quickly became one of the most popular machine learning frameworks used by developers. Google was also using Tensorflow internally, and it benefits Google if more developers know how to use Tensorflow because it increases the potential talent pool for the company to recruit from. Meanwhile, chip companies seem to be optimizing their products, either for Tensorflow directly, or for tensor calculations (as Nvidia is doing with the V100). In other words, chip companies are battling each other to improve Google’s open sourced machine learning framework - a situation that can only benefit Google.

Finally, Google also built its own specialized Tensor Processing Unit, and if the company decides to offer cloud services powered by the TPU, there will be a wide market of developers that could stand to benefit from it (and purchase access to it).

Nvidia Tensor Cores

The Tensor Cores in the Volta-based Tesla V100 are essentially mixed-precision FP16/FP32 cores, which Nvidia has optimized for deep learning applications.

The new mixed-precision cores can deliver up to 120 Tensor TFLOPS for both training and inference applications. According to Nvidia, V100’s Tensor Cores can provide 12x the performance of FP32 operations on the previous P100 accelerator, as well as 6x the performance of P100’s FP16 operations. The Tesla V100 comes with 640 Tensor Cores (eight for each SM).

In the image below, Nvidia is showing how for a matrix-matrix multiplication, commonly used in the training of neural networks, the V100 can be more than 9x faster compared to the Pascal-based P100 GPU.

Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations

The company said that this result is due to the custom crafting of the Tensor Cores and their data paths to maximize their floating point performance with a minimal increase in power consumption.

Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 multiply and FP32 accumulate). A group of eight Tensor Cores in an SM perform a total of 1024 floating point operations per clock.

According to the GPU maker, this is an 8x increase in throughput per SM in Volta, compared to the Pascal architecture. In total, with Volta’s other performance improvements, the V100 GPU can be up to 12x faster for deep learning compared to the P100 GPU.

Developers will be able to program the Tensor Cores directly or make use of V100’s support for popular machine learning frameworks such as Tensorflow, Caffe2, MXNet, and others.

Rise Of The Specialized Machine Learning Chip

In a recent paper, Google revealed that its TPU can be up to 30x faster than a GPU for inference (the TPU can’t do training of neural networks). As the main provider of chips for machine learning applications, Nvidia took some issue with that, arguing that some of its existing inference chips were already highly competitive to the TPU.

Nvidia’s case wasn’t that strong, though, considering that it was comparing its chips only for sub-10ms latency (a less important metric for Google in the data center). Nvidia also ignored the power consumption and cost metrics when it made the comparisons.

Nevertheless, Google's TPU was probably not even the biggest threat to Nvidia’s machine learning business, considering that for now, at least, Google doesn’t intend to sell it, although the company may increasingly use the TPUs more internally.

A bigger threat to Nvidia may be other companies that develop and sell specialized machine learning chips with better performance/watt and cost metrics than a typical GPU can offer to other customers. Even more worrisome for Nvidia could be the fact that Intel has been snapping up all the more interesting ones.

Therefore, until now, it looked as if Nvidia could start to fall behind in this race due to the company’s focus on GPUs. However, with the announcement of the “Tensor Cores,” Nvidia may have made it more difficult for others to beat it in this market (at least for now).