Google Now Offering Pods With 1,000 Cloud TPUs to the Public
At Google I/O 2019 this week, Google announced that its Cloud TPU (tensor processing unit) pods, featuring both the 2nd and 3rd generations of its TPU chips, are now available in public beta on its Google Cloud Platform.
Cloud TPU Pods With 1,000 TPUs
TPUs are ASICs Google developed for machine learning workloads. When Google first announced its TPU chip in 2016, it was a revelation in terms of inference performance. The chip showed up to 30 times higher performance than an Nvidia Kepler GPU (which lacked any optimization for inference at the time) and 80 times the performance of an Intel Haswell CPU. In 2017, Google announced the second-generation TPU, called the “Cloud TPU.” The new chip could now be used not just for inference (running trained machine learning neural network models) but also for training.
Now, developers can access either a full 1,000-TPU pod or “slices” of this pod. Previously, Cloud TPU pods supported 256 Cloud TPUs, but it seems Google has now created a toroidal mesh network across multiple racks, so that a TPU pod can contain more than 1,000 TPUs. Developers can access slices of the pod as small as 16-cores (two TPUs) if they are on a budget.
Google showed at its I/O event that a 256 Cloud TPU v2 slice can train a standard ResNet-50 image classification model using the ImageNet data in 11.3 minutes, while a 256 Cloud TPU v3 slice can train it in 7.1 minutes. According to these numbers, the Cloud TPU v2 is 60% slower than the Cloud TPU v2. Using a single TPU, the same model would be trained in 302 minutes.
Information for accessing the public beta is available on Google's blog post.
Google Cloud TPU's Evolution
When it debuted, Google said then its 2nd-generation TPU could achieve 180 teraflops (TFLOPS) of floating-point performance, or six times more than Nvidia’s latest Tesla V100 accelerator for FP16 half-precision computation. The Cloud TPU also had a 50% advantage over Nvidia’s Tensor Core performance. Google designed its Cloud TPU pods with 64 TPUs each, for a total peak performance of 11.5 petaFLOPS.
A year later, in 2018, the company announced version 3 of its TPU with a performance rated at 420 TFLOPS. The company also announced a new liquid-cooled pod configuration with eight times the performance of the previous one, featuring 256 TPUs and 100 petaFLOPS performance.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Even though Google doesn’t sell the Cloud TPUs directly (it only sells an inference-optimized version called Edge TPU), by giving developers access to them in the cloud, the company is still competing with companies such as Nvidia or Intel that would like developers to buy their machine learning hardware instead. The TPUs tend to have better performance/dollar compared to the alternatives, which should put pressure on machine learning chipmakers to offer higher value.
Google Cloud TPU Use Cases
Google has clarified that not all types of machine learning applications are suited for the Cloud TPU. According to Google, the ones that make the most sense include:
- Models dominated by matrix computations
- Models with no custom TensorFlowo perating inside the main training loop
- Models that rain for weeks or months
- Larger and very large models with very large effective batch sizes
Additionally, Google has recommended against using TPUs for applications such as Linear algebra programs that require frequent branching and workloads that access memory in a sparse manner or require high-precision arithmetic.
Megaupload founder will be extradited to the U.S. to face criminal charges — now-defunct file-sharing website had cost film studios and record companies over $500 million
Spectra Cube heralds new 75,000 TB storage library — tape solution for cloud providers is optimized for ease of use and versatility
-
bit_user
should be "v2 is 60% slower than ... v3".Lucian Armasu said:the Cloud TPU v2 is 60% slower than the Cloud TPU v2
Google was talking about a board with four v2 TPUs on it, which each crank out 45 TFLOPS. So, the entire board was 50% faster than a single (120 TFLOPS) V100.Lucian Armasu said:Google said then its 2nd-generation TPU could achieve 180 teraflops (TFLOPS) of floating-point performance, or six times more than Nvidia’s latest Tesla V100 accelerator for FP16 half-precision computation. The Cloud TPU also had a 50% advantage over Nvidia’s Tensor Core performance.
That also appears to feature boards with four TPUs on them. So, presumably, the single TPU perf jumped to 105 TFLOPS - almost equal with a V100.Lucian Armasu said:A year later, in 2018, the company announced version 3 of its TPU with a performance rated at 420 TFLOPS.
I think it should be "... no custom TensorFlow operating inside ..."Lucian Armasu said:Models with no custom TensorFlowo perating inside the main training loop
Models that rain for weeks or months
and "Models that train for weeks or months".
Huh. They even had to say that? You'd think anyone clever enough to port code to a TPU would be clueful enough to foresee such limitations. I guess I can see them getting lots of questions about general-purpose applicability, with such lofty performance numbers.Lucian Armasu said:Google has recommended against using TPUs for applications such as Linear algebra programs that require frequent branching and workloads that access memory in a sparse manner or require high-precision arithmetic.