Google has designed its own new processors, the Argos video (trans)coding units (VCU), that have one solitary purpose: processing video. According to a recent report, the highly efficient new chips have allowed the technology giant to replace up to tens of millions of Intel CPUs with its own silicon.
For many years Intel's video decoding/encoding engines that come built into its CPUs have dominated the market both because they offered leading-edge performance and capabilities and because they were easy to use. But custom-built application-specific integrated circuits (ASICs) tend to outperform general-purpose hardware because they are designed for one workload only. As such, Google turned to developing its own specialized hardware for video processing tasks for YouTube, and to great effect.
However, Intel may have a trick up its sleeve with its latest tech that could win back Google's specialized video processing business.
Loads of Videos Require New Hardware
Users upload more than 500 hours of video content in various formats every minute to YouTube. Google needs to quickly transcode that content to multiple resolutions (including 144p, 240p, 360p, 480p, 720p, 1080p, 1440p, 2160p, and 4320p) and data-efficient formats (e.g., H.264, VP9 or AV1), which requires formidable encoding horsepower.
Historically, Google had two options for transcoding/encoding content. The first option was Intel's Visual Computing Accelerator (VCA) that packed three Xeon E3 CPUs with built-in Iris Pro P6300/P580 GT4e integrated graphics cores with leading-edge hardware encoders. The second option was to use software encoding and general-purpose Intel Xeon processors.
Google decided that neither option was power-efficient enough for emerging YouTube workloads – the Visual Computing Accelerator was rather power hungry itself, whereas scaling the number of Xeon CPUs essentially meant increasing the number of servers, which means additional power and datacenter footprint. As a result, Google decided to go with custom in-house hardware.
Google's first-generation Argos VCU does not replace Intel's central processors completely as the servers still need to run the OS and manage storage drives and network connectivity. To a large degree, Google's Argos VCU resembles a GPU that always needs an accompanying CPU.
Instead of stream processors like we see in GPUs, Google's VCU integrates ten H.264/VP9 encoder engines, several decoder cores, four LPDDR4-3200 memory channels (featuring 4x32-bit interfaces), a PCIe interface, a DMA engine, and a small general-purpose core for scheduling purposes. Most of the IP, except the in-house designed encoders/transcoders, were licensed from third parties to cut down on development costs. Each VCU is also equipped with 8GB of usable ECC LPDDR4 memory.
The main idea behind Google's VCU is to put as many high-performance encoders/transcoders into a single piece of silicon as possible (while remaining power efficient) and then scale the number of VCUs separately from the number of servers needed. Google places two VCUs on a board and then installs 10 cards per dual-socket Intel Xeon server, greatly increasing the company's decoding/transcoding performance per rack.
Increasing Efficiency Leads to Migration from Xeon
Google says that its VCU-based machines have seen up to 7x (H.264) and up to 33x (VP9) improvements in performance/TCO compute efficiency compared to Intel Skylake-powered server systems. This improvement accounts for the cost of the VCUs (vs. Intel's CPUs) and three years of operational expenses, which makes VCUs an easy choice for video behemoth YouTube.
Offline Two-Pass Single Output (SOT) Throughput in CPU, GPU, and VCU-Equipped Systems
|System||Throughput (MPix/s)||Throughput (MPix/s)||Performance/TCO||Performance/TCO|
|Row 1 - Cell 0||H.264||VP9||H.264||VP9|
|4x Nvidia T4||2,484||-||1.5x||-|
|8x Google Argos VCUs||5,973||6,122||4.4x||20.8x|
|20x Google Argos VCUs||14,932||15,306||7x||33.3x|
From performance numbers shared by Google, it is evident that a single Argos VCU is barely faster than a 2-way Intel Skylake server in H.264. However, since 20 VCUs can be installed into such a server, VCU wins from an efficiency perspective. But when it comes to the more demanding VP9 codec, Google's VCU appears to be five times faster than Intel's dual-socket Xeon and therefore offers impressive efficiency advantages.
Since Google has been using its Argos VCUs for several years now, it clearly replaced many of its Xeon-based YouTube servers with machines running its own silicon. It is extremely hard to estimate how many Xeon systems that Google actually replaced, but some analysts believe the technology giant could have swapped from four to 33 million Intel CPUs for its own VC. Even if the second number is an overestimate, we are still talking about millions of units.
Since Google needs loads of processors for its other services, it is likely that the number of CPUs that the company buys from AMD or Intel is still very high and is not going to decrease any time soon as it will be years before Google's own datacenter-grade system-on-chips (SoCs) will be ready.
It is also noteworthy that in an attempt to use innovative encoding technologies (e.g., AV1) right now, Google needs to use general-purpose CPUs even for YouTube as the Argos does not support the codec. Furthermore, as more efficient codecs emerge (and these tend to be more demanding in terms of compute horsepower), Google will have to continue to use CPUs for initial deployments. Ironically, the advantage of dedicated hardware will only grow in the future.
Google is already working on its second-gen VCU that supports AV1, H.264, and VP9 codecs as its needs to further increase the efficiency of its encoding technologies. It is unclear when the new VCUs will be deployed, but it is clear that the company wants to use its own SoCs instead of general-purpose processors where possible.
Intel Isn't Standing Still
Intel isn't standing still, though. The company's DG1 Xe-LP-based quad-chip SG1 server card can decode up to 28 4Kp60 streams as well as transcode up to 12 simultaneous streams. Essentially, Intel's SG1 does exactly what Google's Argos VCU does: scale video decoding and transcoding performance separately from the server count and thus reduce the number of general-purpose processors required in a data center used for video applications.
With its upcoming single-tile Xe-HP GPU, Intel will offer transcoding of 10 high-quality 4Kp60 streams simultaneously. Keeping in mind that some of Xe-HP GPUs will scale to four tiles, and more than one GPU can be installed per system, Intel's market-leading media decoding and encoding capabilities will only become even more solid.
Google has managed to build a remarkable H.264 and VP9-supporting video (trans)coding unit (VCU) that can offer significantly higher efficiency in video encoding/transcoding workloads than Intel's existing CPUs. Furthermore, VCUs enable Google to scale its video encoding/transcoding performance independently from the number of servers.
Yet, Intel already has its Xe-LP GPUs and SG1 cards that offer some serious video decoding and encoding capabilities, too, so Intel will still be successful in datacenters with heavy video streaming workloads. Furthermore, with the emergence of Intel's Xe-HP GPUs, the company promises to solidify its position in this market.
It's easy to hate on intel but at least they'll sell to anyone.
The tech is amazing though. What process node and foundry are these being made on?
Have the most recent iterations of hardware encoders improved that much that they are on par or better than software based video encoding?
And is it true that the incredible performance of M1 chips in video encoding/decoding tasks is due to some implemented sophisticated hardware accelerators in the iGPU?
FPGA based SmartNICs are also being promoted for fixed operation streaming.
Seriously. Youtube is damned near unwatchable without an ad blocker now.