Intel recently published some Xeon Phi benchmarks, which claimed that its “Many Integrated Core” Phi architecture, based on small Atom CPUs rather than GPUs, is significantly more efficient and higher performance than GPUs for deep learning. Nvidia seems to have taken issue with this claim, and has published a post in which it detailed the many reasons why it believes Intel’s results are deeply flawed.
GPUs Vs. Everything Else
Whether they are the absolute best for the task or not, it’s not much of a debate that GPUs are the mainstream way to train deep learning neural networks right now. That’s because training neural networks requires low precision computation (as low as 8-bit), and not high-precision computation, for which CPUs are generally built. Whether GPUs will one day be replaced by more efficient alternatives for most customers, it remains to be seen.
Nvidia has not only kept optimizing its GPUs for machine learning over the past few years, but it has also invested many resources into the software that makes it easy for developers to train their neural networks. That’s also one of the main reasons why researchers usually go with Nvidia for machine learning rather than AMD. Nvidia said that the performance of its software has improved by an order of magnitude when comparing the Kepler software era to the Pascal one.
However, GPUs are not the only game in town when it comes to training deep neural networks. As the field seems to be booming right now, there are all sorts of companies, old and new, trying to take a share of this market for deep learning-optimized chips.
There are companies that focus on FPGAs for machine learning, but also companies that create custom deep learning chips such as Google, CEVA, and Movidius. Then, there’s also Intel, which wants to compete against GPUs by using dozens of small Atom (Bay Trail-T) cores instead under the Xeon Phi brand.
In its paper, Intel claimed that four Knights Landing Xeon Phi chips were 2.3x faster than “four GPUs.” Intel also claimed that Xeon Phi chips could scale 38 percent better across multiple nodes (up to 128, which according to Intel can’t be achieved by GPUs). Intel said systems made out of 128 Xeon Phi servers are 50x faster than single-Xeon Phi servers, implying that Xeon Phi servers scale rather well.
Intel also said in its paper that when using an Intel-optimized version of the Caffe deep learning framework, its Xeon Phi chips are 30x faster compared to the standard Caffe implementation.
Nvidia’s main arguments seem to be that Intel was using old data in its benchmarks, which can be misleading when comparing against GPUs, especially because Nvidia’s GPUs saw drastic increases in performance and efficiency once they moved from a 28nm planar process to a 16nm FinFET one. Not only that, but in the past few years, Nvidia has also optimized various software frameworks for its GPUs.
That’s why now Nvidia claims that if Intel had used a more recent implementation of the Caffe AlexNet test, it would’ve seen that four of Nvidia’s previous-generation Maxwell GPUs were actually 30% faster than four of Intel’s Xeon Phi servers, according to Nvidia.
In regards to Xeon Phi’s “38% better scaling,” Nvidia also said that Intel’s comparison includes its latest Xeon Phi servers with the latest interconnect technology, which Intel pitted against four-year-old Kepler-based Titan X systems. Nvidia mentioned that Baidu has already proven that, for instance, speech training workloads scale almost linearly across 128 Maxwell GPUs.
Nvidia also believes that, for deep learning, it’s better to have fewer strong nodes than more weaker nodes anyway. It added that a single one of its latest DGX-1 “supercomputer in a box” is slightly faster than 21 Xeon Phi servers, and 5.3x faster than four Xeon Phi servers.
Considering the OpenAI non-profit just became the first ever customer of a DGX-1 system, it’s understandable that Intel couldn’t use one to compare its Xeon Phi chips against. However, Maxwell-based systems are quite old by now, so it’s unclear why Intel decided to test its latest Xeon Phi chips against GPUs from a few generations ago with software from 18 months ago.
AI Chip Competition Heating Up (In A Good Way)
It’s likely that Xeon Phi is still quite behind GPU systems when it comes to deep learning, in both the performance and software support dimensions. However, if Nvidia’s DGX-1 can barely beat 21 Xeon Phi servers, then that also means the Xeon Phi chips are quite competitive price-wise.
A DGX-1 currently costs $129,000, whereas a single Xeon Phi server chip costs anywhere from $2,000 to $6,000. Even when using 21 of Intel’s highest-end Xeon Phi chips, that system still seems to match the Nvidia DGX-1 on price.
Although the fight between Nvidia and Intel is likely to ramp up significantly over the next few years, what’s going to be even more interesting is whether ASIC-like chips like Google’s TPU can actually be the ones to win the day.
Intel is already using more “general purpose” cores for its Phi coprocessor, and Nvidia still has to think about optimizing its GPUs for gaming. That means the two companies may be unable to follow the more extreme optimization paths of custom deep learning chips. However, software support will also play a big role in the adoption of deep learning chips, and Nvidia arguably has the strongest software support right now.