Intel Habana Gaudi Beats Nvidia's H100 in Visual-Language AI Models: Hugging Face

Gaudi 2
(Image credit: Intel)

A new fine-tuning performance benchmark for BridgeTower, a Vision-Language (VL) AI model, has shown that there's life to the AI acceleration camp other than Nvidia's green. While Nvidia does dominate the AI acceleration market (through exceptional foresight, a well-thought-out and documented software stack, and pure processing performance), other players are keen to take a piece of the AI market for themselves. And at least for BridgeTower, Intel's own Gaudi 2 silicon (designed and fabricated through Intel's $2 billion, 2019 acquisition of Habana) has been shown by Hugging Face to outperform Nvidia's A100 80 GB by a staggering 2.5x - and it even beats Nvidia's prodigy-child H100 by 1.4x.

Vision-Language

Vision-Language (VL) refers to AI models that can process and associate information across the modalities of language and visual representation. VL models in specific are commonly associated with image generation models such as Open AI's CLIP and Stable Diffusion XL - a fast-growing market that's being mostly led by Midjourney, Stable Diffusion, and now Ideogram. 

According to Habana, the momentous speedups are the result of a hardware-accelerated data-loading system - one of the bottlenecks for AI model fine-tuning, and especially-so for VL models. Loading a workload into memory is often one a performance bottleneck wherever computing lies, so it's not that out of the left-field that Habana would look to optimize this particular step in the training process. 

The main bottleneck relates to how CPUs get hamered with many costly operations such as image decoding and image augmentation (a similar issue to the GPU draw-call debate), which lead the HPU (or Nvidia GPU) to stall while waiting for further data to be processed (by the CPU) and then sent over to the AI accelerator of choice. This is how the process goes without any hardware acceleration:

  • Fetch data (e.g. where your JPEG images are stored on disk)
  • The CPU reads encoded images
  • The CPU decodes images
  • The CPU applies image transformations to augment images
  • Images are sent to devices (although this is usually not done by the dataloader itself)

And this is the process through Gaudi 2's integrated hardware acceleration, which accelerates image transformation:

  • Fetch data
  • The CPU reads encoded images
  • Encoded images are sent to devices
  • Devices decode images
  • Devices apply image transformations to augment images

Through the hardware acceleration method, it becomes clear that the CPU is much less leveraged (freeing up CPU cycles for other tasks within the fine-tuning main process), which should result in improved performance.

Benchmarking Habana's Gaudi 2 by fine-tuning a pre-trained BridgeTower checkpoint with 866M parameters allows us to see the performance gains that hardware-accelerated image loading brings to the table. The workloads were run in distributed computing across 8 devices each (of Nvidia's A100 80 GB, H100, and Gaudi 2). The results were measured and averaged across three different processing runs, with each run spawning increasing CPU processes fully dedicated to loading data into memory (the first run loads memory within the main CPU process, while runs two and three increase the number of memory-loading processes by one and two, respectively).

Swipe to scroll horizontally
Dataloading performance across Gaudi 2, Nvidia A100, and Nvidia H100. Units expressed in samples per second
Devicedataloader_num_workers=0dataloader_num_workers=1dataloader_num_workers=2dataloader_num_workers=2 + mediapipe_dataloader
Gaudi 2 HPU601.5747.4768.7847.7
H100 GPU336.5580.1602.1N/A
A100 80 GB GPU227.5339.7345.4N/A

The results are clear: the best-case performance scenario for Gaudi 2 is the first, where data is loaded alongside the main training process, with Gaudi 2 besting even Nvidia's H100 by 1.79x, and the A100 by 2.23x. But this is a non-optimized scenario, as Habana itself admitted; so perhaps the most revealing results come from the third datapoint, where two additional processes were spawned to handle data loading outside of the main fine-tuning process. There, Nvidia's products certainly have to squint to catch Gaudi 2's dust-cloud as it runs into the distance: Gaudi 2 delivers the improved 1.3x performance against Nvidia's cream-of-the-crop H100, and a 2.23x performance improvement against the A100 80 GB.

It would be possible to spawn additional processes to handle data-loading; but as it can be seen from the performance progression, that strategy would bring about increasingly diminishing returns. On the Nvidia H100, for instance, performance is improved by 1.72x by spawning a single dedicated data-loading process, but going from one process to two only brings an additional 3% improvement. Due to Habana's ability to bring most data-loading steps into Gaudi 2, however, the company can unlock an additional 10% performance improvement against its own best score (where data loading an transformations are handled by two CPU processes).

There's still a long way to go before any company can claim hegemony in the AI-acceleration space. Nvidia has an incredible product and software stack that has allowed it to gain the first-mover advantage; but we've seen enough races where the underdogs catch up to (and sometimes even surpass) the favorites to know that Intel, AMD and others are all looking to steal Nvidia's thunder.

Francisco Pires
Freelance News Writer

Francisco Pires is a freelance news writer for Tom's Hardware with a soft side for quantum computing.

  • bit_user
    Ugh. Okay, here we go...

    This immediately stuck me as weird, because Nvidia added hardware acceleration for JPEG decoding. It's described specifically in reference to the A100, here:
    https://developer.nvidia.com/blog/leveraging-hardware-jpeg-decoder-and-nvjpeg-on-a100/
    There, they compare it to CPU-accelerated and GPU (software)-accelerated decoding:

    As that implies, there are two options for GPU-accelerated decoding:
    Use generic CUDA code/cores to accelerate the parallel portions of decoding (e.g. dequantization, IDCT, resampling, and colorspace transform).
    Use the NVJPEG engine, newly added to the A100.
    According to this diagram, the NVJPEG Engine even handles huffman decoding:

    That diagram assumes you want the final image on the CPU, but elsewhere in the page they state the library has the capability for:
    * Input to the library is in the host memory, and the output is in the GPU memory.
    So, what are we to make of Hugging Face's data? I checked the blog entry cited by the article, and they make absolutely no mention of nvJPEG. I went one step further, and searched the linked git repo, also finding no reference to nvJPEG. I wouldn't say that's conclusive, because I don't know enough about how all of its dependencies are provided or exactly where you'd expect to see nvJPEG show up, if it's indeed capable of being used. However, I think I've done enough digging that questions should be raised and answered for.

    If their blog post were instead an academic paper, you'd absolutely expect them to mention nvJPEG and either demonstrate that it's being used or explain why not. If they were comparing against nvJPEG, then you'd expect them to point out how superior Habana's solution is to even Nvidia's purpose-built hardware engine. As it stands, this smells fishy. Either the study's authors are not truly disinterested in the outcome, or somewhat surprisingly ignorant and incurious about Nvidia's solution to this problem. Given that they correctly pointed out that it's a bottleneck, it'd be awfully surprising for Nvidia not to have taken notice or done anything to effectively alleviate it.

    Another thought I had is that I don't know how heavy-weight that Hugging Face model is. If I were looking to accentuate a bottleneck in JPEG decoding, I'd use a relatively light-weight model that caters well to the other strengths of Habana's hardware. In other words, even if their experiment is properly conducted, their findings might not be applicable to many other models people are using, not to mention newer chips like the H100.
    Reply