Google Now Offering Pods With 1,000 Cloud TPUs to the Public

At Google I/O 2019 this week, Google announced that its Cloud TPU (tensor processing unit) pods, featuring both the 2nd and 3rd generations of its TPU chips, are now available in public beta on its Google Cloud Platform.

Cloud TPU Pods With 1,000 TPUs

TPUs are ASICs Google developed for machine learning workloads. When Google first announced its TPU chip in 2016, it was a revelation in terms of inference performance. The chip showed up to 30 times higher performance than an Nvidia Kepler GPU (which lacked any optimization for inference at the time) and 80 times the performance of an Intel Haswell CPU. In 2017, Google announced the second-generation TPU, called the “Cloud TPU.” The new chip could now be used not just for inference (running trained machine learning neural network models) but also for training.

Now, developers can access either a full 1,000-TPU pod or “slices” of this pod. Previously, Cloud TPU pods supported 256 Cloud TPUs, but it seems Google has now created a toroidal mesh network across multiple racks, so that a TPU pod can contain more than 1,000 TPUs. Developers can access slices of the pod as small as 16-cores (two TPUs) if they are on a budget.

Google showed at its I/O event that a 256 Cloud TPU v2 slice can train a standard ResNet-50 image classification model using the ImageNet data in 11.3 minutes, while a 256 Cloud TPU v3 slice can train it in 7.1 minutes. According to these numbers, the Cloud TPU v2 is 60% slower than the Cloud TPU v2. Using a single TPU, the same model would be trained in 302 minutes.

Information for accessing the public beta is available on Google's blog post.

Google Cloud TPU's Evolution

When it debuted, Google said then its 2nd-generation TPU could achieve 180 teraflops (TFLOPS) of floating-point performance, or six times more than Nvidia’s latest Tesla V100 accelerator for FP16 half-precision computation. The Cloud TPU also had a 50% advantage over Nvidia’s Tensor Core performance. Google designed its Cloud TPU pods with 64 TPUs each, for a total peak performance of 11.5 petaFLOPS.

A year later, in 2018, the company announced version 3 of its TPU with a performance rated at 420 TFLOPS. The company also announced a new liquid-cooled pod configuration with eight times the performance of the previous one, featuring 256 TPUs and 100 petaFLOPS performance.

Even though Google doesn’t sell the Cloud TPUs directly (it only sells an inference-optimized version called Edge TPU), by giving developers access to them in the cloud, the company is still competing with companies such as Nvidia or Intel that would like developers to buy their machine learning hardware instead. The TPUs tend to have better performance/dollar compared to the alternatives, which should put pressure on machine learning chipmakers to offer higher value.

Google Cloud TPU Use Cases

Google has clarified that not all types of machine learning applications are suited for the Cloud TPU. According to Google, the ones that make the most sense include:

Models dominated by matrix computations
Models with no custom TensorFlowo perating inside the main training loop
Models that rain for weeks or months
Larger and very large models with very large effective batch sizes

Additionally, Google has recommended against using TPUs for applications such as Linear algebra programs that require frequent branching and workloads that access memory in a sparse manner or require high-precision arithmetic.

TOPICS

Lucian Armasu is a Contributing Writer for Tom's Hardware US. He covers software news and the issues surrounding privacy and security.

3 Comments Comment from the forums

jakjawagon

the Cloud TPU v2 is 60% slower than the Cloud TPU v2
(y)
Reply
bit_user

Lucian Armasu said:
the Cloud TPU v2 is 60% slower than the Cloud TPU v2
should be "v2 is 60% slower than ... v3".

Lucian Armasu said:
Google said then its 2nd-generation TPU could achieve 180 teraflops (TFLOPS) of floating-point performance, or six times more than Nvidia’s latest Tesla V100 accelerator for FP16 half-precision computation. The Cloud TPU also had a 50% advantage over Nvidia’s Tensor Core performance.
Google was talking about a board with four v2 TPUs on it, which each crank out 45 TFLOPS. So, the entire board was 50% faster than a single (120 TFLOPS) V100.

Lucian Armasu said:
A year later, in 2018, the company announced version 3 of its TPU with a performance rated at 420 TFLOPS.
That also appears to feature boards with four TPUs on them. So, presumably, the single TPU perf jumped to 105 TFLOPS - almost equal with a V100.

Lucian Armasu said:
Models with no custom TensorFlowo perating inside the main training loop
Models that rain for weeks or months
I think it should be "... no custom TensorFlow operating inside ..."
and "Models that train for weeks or months".

Lucian Armasu said:
Google has recommended against using TPUs for applications such as Linear algebra programs that require frequent branching and workloads that access memory in a sparse manner or require high-precision arithmetic.
Huh. They even had to say that? You'd think anyone clever enough to port code to a TPU would be clueful enough to foresee such limitations. I guess I can see them getting lots of questions about general-purpose applicability, with such lofty performance numbers.
Reply
bit_user

Also, why is this thread in the Storage forum?
Reply