Netflix has been serving up to 200 Gbps of TLS-encrypted video from a single server since 2020. Nonetheless, the company aims to double the bandwidth to 400 Gbps. During his presentation at the EuroBSD 2021 conference (via HardwareLuxx), Andrew Gallatin, Senior Software Engineer at Netflix, detailed the challenges of pushing the bandwidth envelope on its FreeBSD-based servers.
Netflix turned to AMD's EPYC Rome processors to achieve its goal. The company equipped its server with the EPYC 7502P, which wields 32 Zen 2 cores with a 2.5 GHz base clock and 3.35 GHz boost clock. More importantly, the 32-core beast offers up to 128 PCIe 4.0 lanes, good for about 250 GBps of bandwidth or around 2 Tbps in networking units. Netflix paired the EPYC 7502P with 256GB of DDR4-3200 memory, with a total memory bandwidth of up to 150 GBps, or 1.2 Tbps in networking units.
For storage, Netflix's AMD-powered server utilizes 18 Western Digital WD SN720 2TB NVMe SSDs. It's also equipped with a pair of Nvidia's Mellanox ConnectX-6 Dx network adapters that communicate through a PCIe 4.0 x16 interface. Initially, Netflix was only getting 240 Gbps out of the server, primarily due to the limitation on the memory.
Netflix experimented with different NUMA (Non Uniform Memory Architecture) configurations to maximize the bandwidth. AMD's EPYC processors support different NUMA nodes per socket, which can either be 1, 2 or 4. Naturally, the processor dictates which modes are available or not. The EPYC 7502P, which is the SKU used in Netflix's server, supports all three NUMA modes. According to Gallatin's slide, a single NUMA node configuration delivers up to 240 Gbps, while a setup with four NUMA nodes bumps the value up to 280 Gbps.
In an attempt to optimize the performance and avoid hardware bottlenecks, Netflix tested offloading the TLS encryption to the Mellanox ConnectX-6 Dx, instead of the EPYC 7502P. With a bit of tinkering with the software and some firmware updates, Netflix managed to squeeze 190 Gbps per Mellanox ConnectX-6 Dx adapter or 380 Gbps with two network adapters. The encryption no longer passes through the processor, so it helps free up resources and cuts memory bandwidth by half. The results showed 50% processor utilization, with four NUMA nodes and around 60% without NUMA.
Netflix Server Configurations
|Header Cell - Column 0||AMD||Intel||Ampere|
|Processor||EPYC 7502P (Rome)||Xeon Platinum 8352V (Ice Lake)||Altra Q80-30|
|Memory||256GB DDR4-3200||256GB DDR4-3200 @ DDR4-2933||256GB DDR4-3200|
|Storage||18 x WD SN720||20 x Kioxia 4TB NVMe (PCIe 4.0)||16 x WD SN720|
|Network Adapters||2 x Mellanox ConnectX-6 Dx||2 x Mellanox ConnectX-6 Dx||2 x Mellanox ConnectX-6 Dx|
Netflix evaluated other processor options from Intel and Ampere, but AMD was clearly the superior option. For example, the EPYC 7502P offered 280 Gbps, while the Xeon Platinum 8352V (Ice Lake) and Altra Q80-30 delivered 230 Gbps and 180 Gbps, respectively.
The memory was the bottlenecked on the Intel system, since the Xeon Platinum 8352V natively supports DDR4-2933 as opposed to the EPYC 7502P's DDR4-3200 support. Gallatin expects similar performance with the EPYC 7502P if the Ice Lake chip is paired with equivalent memory. While the Altra Q80-30 from Ampere does support DDR4-3200 memory, the chip is limited to 180 Gbps.
Nevertheless, the Altra Q80-30 was the closest competitor to the EPYC 7502P with the TLS offload. The system offered 240 Gbps, but Gallatin noted low processor utilization and many output drops, which could be a PCIe-specific problem. After enabling extended tags, the Altra Q80-30 system pumped out 320 Gbps, just 60 Gbps lower than the EPYC 7502P system. Apparently, the Xeon Platinum 8352V system had the PCIe relaxed ordering option locked out, so Gallatin wasn't able to assess the performance of the network adapter.
While 400 Gbps sounds good and all, Netflix already has a 800 Gbps prototype for testing. Gallatin didn't share the specifications, but he hinted that we may hear about it next year.