Nvidia CEO Jensen Huang took to the stage at GTC Japan to announce the company's latest advancements in AI, which includes the new Tesla T4 GPU. This new GPU, which Nvidia designed for inference workloads in hyperscale data centers, leverages the same Turing microarchitecture as Nvidia's forthcoming GeForce RTX 20-series gaming graphics cards.
But the Tesla T4 is a unique graphics card designed specifically for AI inference workloads, like neural networks that process video, speech, search engines, and images. Nvidia's previous-gen Tesla P4 fulfilled this role in the past, but Nvidia claims the new model offers up to 12 times the performance within the same power envelope, possibly setting a new bar for power efficiency in inference workloads.
Row 0 - Cell 0 | FP16 | INT8 | INT4 |
Nvidia Tesla T4 (TFLOPS) | 65 | 130 | 260 |
Nvidia Tesla P4 (TFLOPS) | 5.5 | 22 | - |
The Tesla T4 GPU comes equipped with 16GB of GDDR6 that provides up to 320GB/s of bandwidth, 320 Turing Tensor cores, and 2,560 CUDA cores. The T4 features 40 SMs enabled on the TU104 die to optimize for the 75W power profile.
The GPU supports mixed-precision, such as FP32, FP16, and INT8 (performance above). The Tesla T4 also features an INT4 and (experimental) INT1 precision mode, which is a notable advancement over its predecessor.
Like its predecessor, the low-profile Tesla T4 consumes just 75 watts and slots into a standard PCIe slot in servers, but it doesn't require an external power source (like a 6-pin connector). The cards' low-power design doesn't require active cooling (like a fan)–the high linear airflow inside of a typical server will suffice. Nvidia tells us that the die does come equipped with RT Cores, just like the desktop models, but that they will be useful for raytracing or VDI (Virtual Desktop Infrastructure), implying they won't be used for most inference workloads.
The Tesla T4 also features optimizations for AI video applications. These are powered by hardware transcoding engines that provide twice the performance of the Tesla P4. Nvidia says the cards can decode up to 38 full-HD video streams simultaneously.
Nvidia's TensorRT Hyperscale platform is a collection of technologies wrapped around the T4. As expected, the card supports all the major deep learning frameworks, such as PyTorch, TensorFlow, MXNet, and Caffee2. Nvidia also offers its TensorRT 5, a new version of Nvidia's deep learning inference optimizer and runtime engine that supports Turing Tensor Cores and multi-precision workloads. Nvidia also announced the Turing-optimized CUDA 10, which includes optimized libraries, programming models, and graphics API interoperability.
Nvidia also announced the AGX lineup, which is a new name for Nvidia's line of Xavier-based products that are designed for autonomous machine systems that range from robots to self-driving cars. The lineup includes Drive Xavier and the newly-finalized Drive Pegasus that originally featured two Xavier processors and two Tesla V100 GPUs. Nvidia has now updated the GPUs to Turing models. Nvidia is also offering a similar design, called the Clara Platform, for medical applications. The Clara Platform features a single Xavier processor and Turing GPU.
Thoughts
Nvidia's focus on boosting performance in inference workloads is a strategic move: the company projects the inference market will grow to a $20 billion TAM over the next five years. Meanwhile, Intel claims that most of the world's inference workloads run on Xeon processors, which is likely true given Intel's presence in ~96% of the world's servers. Intel announced during its recent Data-Centric Innovation Summit that the company sold $1 billion in processors for AI workloads in 2017 and expects that number to grow quickly over the coming years.
Inference workloads will be a hotly contested battleground between Nvidia, Intel, and AMD in the future, with Intel having the initial advantage due to its server attach rate. However, low-cost and low-power inference accelerators, such as Nvidia's new Tesla T4, pose a tremendous threat due to their performance-per-watt advantages, and AMD has its 7nm Radeon Instinct GPUs for deep learning coming soon. Several companies, such as Google with its TPUs, are developing their own custom silicon for inferencing workloads. That means it will likely be several years before the clear winners become apparent.