Intel demonstrates PyTorch AI optimizations for accelerating large language models on its Arc Alchemist GPUs
Run Llama 2 and other LLMs on your Arc GPU
Intel's Arc Alchemist GPUs can run large language models like Llama 2, thanks to the company's PyTorch extension, as demoed in a recent blog post. The Intel PyTorch Extension, which works on both Windows and Linux, allows LLMs to take advantage of the FP16 performance on Arc GPUs. However, given that Intel says you'll need 14GB of VRAM to use Llama 2 on Intel hardware, it means you'll probably want an Arc A770 16GB card.
PyTorch is an open-source framework, developed by Meta, for machine learning that can then be used to work on LLMs. While this software works out of the box, it's not coded by default to take full advantage of every piece of hardware, which is why Intel has its PyTorch extension. This software is designed to take advantage of the XMX cores inside Arc GPUs, and saw its first release in January 2023. Similarly, AMD and Nvidia both have optimizations for PyTorch for optimization purposes.
In its blog post, Intel demonstrates the performance capabilities of the Arc A770 16GB in Llama 2 using the latest update to Intel's PyTorch extension, which came out in December and specifically optimized FP16 performance. FP16, or half-precision floating point data, exchanges precision for performance, which is often a good tradeoff for AI workloads.
The demo shows Llama 2 and the dialogue-focused Llama 2-Chat LLMs, asking questions like "can deep learning have such generalization ability like humans do?" In response, the LLM was surprisingly humble and said deep learning wasn't on the same level as human intelligence. However, in order to run LLMs like Llama 2 with FP16 precision, you'll need 14GB of VRAM according to Intel, and we also didn't get any numbers on how quickly it responded to inputs and queries.
While this demo only showcases FP16 performance, Arc Alchemist also has BF16, INT8, INT4, and INT2 capabilities. Of these other data formats, BF16, is of particular note, as it's often considered to be even better for AI workloads thanks to its wider numerical range, which is on par with FP32 at eight bits while FP16 just has five. Optimizing BF16 performance could be high up on Intel's list for its next PyTorch extension update.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Matthew Connatser is a freelancing writer for Tom's Hardware US. He writes articles about CPUs, GPUs, SSDs, and computers in general.
-
H4UnT3R So they put there which model? 13B? Edit - ok, 7B... And tokens per second? I was running 13B, quantified on Quadro P5000 with ~16t/s, looks like Intel is still far behind CUDA...Reply -
CmdrShepard
I don't think CUDA has anything to do with that, you simply have faster GPU even if a bit long in the tooth.H4UnT3R said:So they put there which model? 13B? Edit - ok, 7B... And tokens per second? I was running 13B, quantified on Quadro P5000 with ~16t/s, looks like Intel is still far behind CUDA...
Perhaps they can optimize that further, but a 3rd player in GPU landscape is sorely needed so I hope they succeed. -
H4UnT3R
Me too. I was thinking about ARC because of price, about AMD too, but they don't have as good and easy to use tools as nVidia, plus really that CUDA architecture is approx. 2.5x faster with same cores count than these two manufacturers. Intel has some openAPI attempts, been of few workshops, but still far behind nVidia and their tools and developer help.CmdrShepard said:I don't think CUDA has anything to do with that, you simply have faster GPU even if a bit long in the tooth.
Perhaps they can optimize that further, but a 3rd player in GPU landscape is sorely needed so I hope they succeed.