Xilinx One-Ups Intel With PCIe 4.0 Alveo U50 Data Center Card

A day after Intel launched its second-generation Programmable Acceleration Card (PAC) for the data center, Xilinx on Tuesday announced the new Alveo U50 accelerator card with PCIe 4.0 and HBM.

The low-profile Alveo U50 is powered by Xilinx’ 16nm Ultrascale+ architecture, not the new Versal architecture that recently started shipping. The announcement comes on the heels of Intel's D5005, its first PAC with the 14nm Stratix 10 FPGA.

However, the Alveo card that Xilinx launched has some features that the Stratix 10 SX-based D5005 card does not have. The U50 contains 8GB HBM2, as opposed to the 32GB of DDR4 that the D5005 has. That provides the Xilinx card with 460 GBps bandwidth. Moreover, the U50 supports PCIe 4.0 and Arm's cache-coherent CCIX protocol, both the first of its kind for accelerator cards, making it a good fit for AMD’s upcoming Rome CPUs.

With 100GbE (QSFP 28) support, the networking functionality matches Intel’s. In its announcement, Xilinx said the port also supports “advanced applications like NVMe-oF solutions (NVM Express over Fabrics), disaggregated computational storage and specialized financial services applications.” Xilinx’ card also wins in power, with a TDP of 75W compared to Intel’s 215W.

Xilinx sees the offering used in use cases such as machine learning inference, video transcoding and data analytics, computational storage, electronic trading and financial risk modeling. The company has also given several performance numbers, including:

  • 25x lower latency and 10x higher throughput in speech translation compared to Tesla T4
  • 4x higher throughput and 3x lower operational cost compared to the 24-core Intel Xeon Platinum 8260
  • 20x higher compression/decompression throughput and 30% lower cost per node compared to the 22-core Intel Skylake-SP 6152

The Alveo U50 is not Xilinx’ first accelerator card using Xilinx’ Ultrascale+ architecture. The company launched the U200 and U250 in October. Those both had networking (two ports each) and TDP specs more closely match Intel's new offering.

The U50 is sampling now with general availability following this fall.

Photo Credits: Xilinx

1 comment
    Your comment
  • bit_user
    Quote:
    25x lower latency and 10x higher throughput in speech translation compared to Tesla T4

    Nice try, but these benchmarks typically use a batch size of 1, which puts GPUs at an unrealistic disadvantage. I'm also curious if their benchmark used its Tensor cores and if they used any of their integer functionality.

    Using a realistic batch size, I'd be really surprised if it could beat the T4.