Breakthrough DL Training Algorithm on Intel Xeon CPU System Outperforms Volta GPU By 3.5x

(Image credit: Intel)

Updated 11:00am PT: Corrected the article to reflect that the tests were conducted with a single V100 GPU. 

Original Article:

Computer scientists from Rice University, in collaboration with Intel Labs, have announced a breakthrough new deep learning algorithm – called SLIDE – that trains AI models faster on CPUs than traditional algorithms on GPUs. For some types of computation, this effectively moves the performance crown of fastest chip for training to CPUs.

In particular, the researchers benchmarked a system with 44 “Xeon-class cores” against a $100,000 system with eight Nvida Volta V100 GPUs with tensor cores, although they only used one V100 for the tests. The Xeon system completed the task in one hour using SLIDE, compared to 3.5 hours for a single Volta V100 with a TensorFlow implementation. The researchers also noted that the algorithm may be further optimized as it competes against a mature (software and hardware) platform. For example, it did not yet use Intel's DLBoost acceleration. 

The algorithms are based on hashing, instead of matrix multiplication-based back-propagation.

SLIDE: A new algorithm for DL training

Since deep learning applications have gained momentum in the last several years, Nvidia GPUs have been considered the gold standard for training the models – although the trained models themselves often run on CPUs when deployed, called inference. Nevertheless, specialized hardware from a number of parties and startups have gone into production. Nvidia, for its part, added specialized tensor cores in the 2017 Volta architecture.

GPUs are favored over CPUs due to the heavy use of matrix multiplications in frameworks such as TensorFlow, in particular, a deep neural network training technique called back-propagation. This is well-suited for GPUs due to the high number of cores used to perform many calculations in parallel. Nvidia’s data center business grew 41% last quarter to almost $1 billion in revenue.

This is where Rice’ new algorithm comes in, called sub-linear deep learning engine, or SLIDE. It runs on standard processors without acceleration hardware and it can outperform GPUs “on industry-scale recommendation datasets with large fully connected architectures,” said Anshumali Shrivastava, an assistant professor in Rice’s Brown School of Engineering who invented SLIDE with graduate students Beidi Chen and Tharun Medini.

Instead of back-propagation, it takes another approach using a technique called hashing that turns neural network training into a search problem – solved with hash tables.

In general, hashing directly maps some input to some output. This mapping is typically done with a relatively simple module function. This effectively creates an index of the inputs, called a hash table. This table can be searched very quickly because the hash function (such as the modulo operation, with the module number the amount of entries in the hashing table) encodes in which entry of the table the input it located.

Rice explained the reason for using hashing by referring to the neurons that are actually trained. In simple terms, the output neuron(s) of the neural network will – for example in image recognition – encode what is being recognized in the image. In self-driving cars, this might be features on the road. Full neural networks contain many (layers of) neurons, which is why they are so compute-intensive. This has created opportunities for optimizations, as not all neurons will contribute critically to the output in every scenario:

“You don’t need to train all the neurons on every case,” Medini said. “We thought, ‘If we only want to pick the neurons that are relevant, then it’s a search problem.’ So, algorithmically, the idea was to use locality-sensitive hashing to get away from matrix multiplication.”

To get away from the matrix multiplications and implement the hashing, the researchers noted that they wrote their algorithm from scratch in C++ instead of the popular frameworks such as TensorFlow. This features makes it likely unsuitable for GPUs.

A key characteristic of SLIDE, the researchers further say, is that it is data parallel. By this they mean that SLIDE can train on all output features (such as all roadway features) simultaneously. “This is much a better utilization of parallelism for CPUs,” one of the researchers said.

Performance

Nevertheless, the code had some performance issues. The Rice University researchers published their initial results and code in March 2019, and were contacted by Intel Labs shortly after. Intel, like the researchers, had noted that the code resulted in many cache misses. This means that the required data is not found in the CPU cache, obviously resulting in a performance hit. 

“The flipside, compared to GPU, is that we require a big memory. There is a cache hierarchy in main memory, and if you’re not careful with it you can run into a problem called cache thrashing, where you get a lot of cache misses. Our collaborators from Intel recognized the caching problem. They told us they could work with us to make it train even faster, and they were right. Our results improved by about 50% with their help.”

Comparing a GPU system to a CPU system in a benchmark, the 44-core Xeon system outperformed the a V100 by 3.5x:

“[I]n our test case we took a workload that’s perfect for V100, one with more than 100 million parameters in large, fully connected networks that fit in GPU memory. We trained it with the best (software) package out there, Google’s TensorFlow, and it took 3 1/2 hours to train. We then showed that our new algorithm can do the training in one hour, not on GPUs but on a 44-core Xeon-class CPU,” Shrivastava said.

Intel does not have any 44-core CPUs, so it is likely the researcher is referring to either a 22-core CPU with 44 threads due to HyperThreading or, perhaps more likely, a 2P system with two 22-core Xeons. Either way, the performance advantage of the SLIDE algorithm is very large and suggests that it may pave the way for a resurgence in CPU training if it can achieve commercialization.

 Optimizations

The researchers say that there are further performance improvements left as they have “just scratched the surface”. To that end, they say that they have not used vectorization – such as AVX SIMD instructions – including Intel’s DLBoost acceleration and claimed “there are a lot of other tricks we could still use to make this even faster.” 

Intel DLBoost, with int8 instructions, is currently geared towards deep learning inference, but the upcoming Cooper Lake CPUs will get support for bfloat16. Intel claimed Cooper Lake will improve training performance by 60%, although obviously referring to conventional training.

Changing the field

From a broader perspective, the researchers note that SLIDE shows that there are other ways to implement deep learning.

“The whole message is, ‘Let’s not be bottlenecked by multiplication matrix and GPU memory,'” Chen said. “Ours may be the first algorithmic approach to beat GPU, but I hope it’s not the last. The field needs new ideas, and that is a big part of what MLSys is about.”

The researchers have not talked about any plans or prospects of their algorithm for commercial adoption, but given Intel's early involvement, it is likely Intel will explore these possibilities. Since Raja Koduri joined Intel, the company has become more and more vocal about the importance of software for unlocking the gains from hardware, and the investments it is making in that area such as oneAPI.

The paper (PDF) was presented at the MLSys conference in Austin.

  • TechLurker
    Now they just need to adapt it to AMD EPYC CPUs.
    Reply
  • AnimeMania
    TechLurker said:
    Now they just need to adapt it to AMD EPYC CPUs.
    I was thinking the same thing, more cores and larger caches.
    Reply
  • JayNor
    from the paper ... all the info on the CPUs/Threads is there:
    "All the experiments are conducted on a server equipped with two 22-core/44-thread processors (Intel Xeon E5-2699A v4 2.40GHz) and one NVIDIA TeslaV100 Volta 32GB GPU. "

    "As mentioned before, ourmachine has 44 cores, and each core can have 2 threads.However, we disable multithreading and the effective number of threads and cores is the same. Hence, we interchangeably use the words “threads” and “cores” from here on. We benchmark both frameworks with 2, 4, 8, 16, 32, 44 threads."
    Reply
  • Gomez Addams
    What I think would be interesting is if they were to adapt their algorithm to GPUs. Then it could be even more parallel.

    Yes, I read they need a large memory and that could be prohibitive since GPUs max out at 32GB. Maybe Nvidia will up the ante at the end of this month and announce a GPU with 64GB of RAM, or more.
    Reply
  • JayNor
    The code is on github, from the link in the paper. The build options enable avx512.

    https://github.com/keroro824/HashingDeepLearning/tree/master/SLIDE
    Reply
  • ajlogo
    I'm sorry but "Tom" has been played by Intel "research"


    "Updated 11:00am PT: Corrected the article to reflect that the tests were conducted with a single V100 GPU. "
    The original claim was with eight, good catch.

    "The researchers have not talked about any plans or prospects of their algorithm for commercial adoption"

    " this effectively moves the performance crown of fastest chip for training to CPUs."
    Reply
  • alextheblue
    ajlogo said:
    I'm sorry but "Tom" has been played by Intel "research"


    "Updated 11:00am PT: Corrected the article to reflect that the tests were conducted with a single V100 GPU. "
    The original claim was with eight, good catch.

    "The researchers have not talked about any plans or prospects of their algorithm for commercial adoption"

    " this effectively moves the performance crown of fastest chip for training to CPUs."
    I mean, it's still pretty impressive if two 22 core CPUs are faster than two V100s.
    Reply
  • zikitariam
    Tensorflow although commercialized by google is one of the slowest, if not the slowest deep learning libraries that exists. So this article is inaccurate and misleading in saying "3.5 times slower than gpu training". Also a 48 core cpu system might not be a single machine. These tests and comparisons need to benchmark fairly using all deep learning libraries (especially pytorch and chainer).
    Reply