Move Over GPUs: Startup's Chip Claims to Do Deep Learning Inference Better

Credit: HabanaCredit: Habana

Habana Labs, a startup that came out of “stealth mode” this week, announced a custom chip that is said to enable much higher machine learning inference performance compared to GPUs.

Habana Goya Specifications

According to the startup, its Goya chip is designed from scratch for deep learning inference, unlike GPUs or other types of chips that have been repurposed for this task. The chip’s die is composed of eight VLIW Tensor Processing Cores (TPCs), each having their own local memory, as well as access to shared memory. The external memory is accessed through a DDR4 interface. The processor supports the FP32, INT32, INT16, INT8, UINT32, UINT16 and UINT8 data types.

The Goya chip supports all the major machine learning software frameworks, including TensorFlow, MXNet, Caffe2, Microsoft Cognitive Toolkit, PyTorch and the Open Neural Network Exchange Format (ONNX). After a trained neural network model is loaded, the chip converts it to an internal format that’s more optimized for the Goya chip.

Models for vision, neural machine translation, sentiment analysis and recommender systems have been executed on the Goya chip, and Habana said that the processor should handle all sorts of inference workloads and application domains.

Goya Performance

Habana says the Goya chip has shown a performance of 15,000 ResNet-50 images/second with a batch size of 10 and a latency of 1.3ms, while using only 100W. In comparison, Nvidia’s V100 GPU has shown a performance of 2,657 images/second.

A dual-socket Xeon 8180 was able to achieve an even lower performance than that: 1,225 images/second. According to Habana, when using a batch size of one, the Goya chip can handle 8,500 ResNet-50 images/second with a 0.27-ms latency.

Credit: HabanaCredit: Habana
This level of inference performance is given by the chip’s architecture design, mixed-format quantization, a proprietary graph compiler and software-based memory management.

Habana intends to reveal a deep learning training chip, called Gaudi, to pair with its Goya inference processor. The two chips will actually use the same VLIW core of Goya and will be software-compatible with it. The 16nm Gaudi chip will start sampling in Q2 2019.

Create a new thread in the News comments forum about this subject
This thread is closed for comments
7 comments
Comment from the forums
    Your comment
  • bit_user
    Something seems wrong with their benchmark, if a V100 only rates 2x as fast as Intel Xeon. I'm skeptical even 56 Xeon cores would be that fast.

    Anyhow, V100 is old news. Turing is yet 2x to 4x faster, still.
  • alextheblue
    Anonymous said:
    Something seems wrong with their benchmark, if a V100 only rates 2x as fast as Intel Xeon. I'm skeptical even 56 Xeon cores would be that fast.

    Anyhow, V100 is old news. Turing is yet 2x to 4x faster, still.

    I can't help but think this wouldn't be a match for Nvidia's tensor cores if they (Nvidia) built a chip that was basically a big 100W Tensor block. But maybe I'm wrong. As of today though this Habana design does have the best performance in that power envelope. Of course, that's all just on paper.
  • WINTERLORD
    so my next motherboard will have one of these for realtime raytracing kinda makes me wonder now what amd will offer in the future