Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance

(Image credit: Intel)

SPEC says it will no longer be publishing SPEC CPU 2017 results for Intel CPUs running a specific version of the Intel compiler, citing displeasure over an apparent targeted optimization for a specific workload (via ServeTheHome and Phoronix) that essentially amounts to cheating. A note has been added to the more than 2,600 benchmark results published with the offending compiler, effectively invalidating those results, mostly from machines running 4th Gen Xeon Sapphire Rapids CPUs.

SPEC CPU 2017 is a benchmark mostly used for high-end servers, data centers, and workstations/PCs, and it tests performance in various workloads in a standardized way so that different computers can be compared to each other. Good performance in SPEC CPU 2017 hinges not just on hardware but also on software. One of the key factors in software-side optimization is the compiler, which is a program that basically takes written code and reformats it in a way that a processor can run it best.

The disclaimer that it is now attached to over 2,600 SPEC CPU 2017 results states, "The compiler used for this result was performing a compilation that specifically improves the performance of the 523.xalancbmk_r / 623.xalancbmk_s benchmarks using a priori knowledge." This means the compiler (in this case, Intel's oneAPI DPC++/C++ Compiler) was not optimized for the kind of workload the two SPEC CPU 2017 benchmarks in question test, but specifically the two benchmarks themselves

While it's expected that compilers should be optimized since more performance is obviously better, optimizing specifically for benchmarks is controversial and frowned upon. SPEC wants its benchmarks to reflect the real-world performance of hardware and to provide a standardized way to compare different processors. But if a compiler optimization only improves performance in a particular benchmark and not in a real-world scenario, then that's clearly not reflective of the real world and will only be reflected in that specific benchmark.

According to Phoronix, the optimization could boost performance in SPECint by 9% overall. The publication also notes that versions 2022.0 to 2023.0 of the Intel oneAPI Compiler are impacted, meaning most of the now-invalidated results were run in 2022, largely on Sapphire Rapids CPUs. Results for fifth-gen Xeon Emerald Rapids CPUs are very unlikely to have been running a version of the compiler with the banned optimization since Emerald Rapids came out after the good versions of the compiler were available.

Benchmark-specific optimization has been a hot topic for years. Back in 2003, Nvidia was accused of performing a driver-side optimization to boost the performance of its GPUs in 3DMark 2003. In 2010, Nvidia itself alleged that AMD was cheating in actual games by not enabling certain driver-side settings that would have significantly boosted visual quality at the expense of performance. Accusations these days don't get quite as heated, though SPEC has certainly shamed Intel in this case.

TOPICS

Matthew Connatser is a freelancing writer for Tom's Hardware US. He writes articles about CPUs, GPUs, SSDs, and computers in general.

43 Comments Comment from the forums

PEnns

"...essentially amounts to cheating"
WHAT??, Intel cheating?? I am shocked, shocked!!
Reply
peachpuff

Reply
TerryLaze

What the heck SPEC, so basically what you are saying is that you are using useless benchmarks that don't target any kind of workload, or even specific applications, but then you are salty that somebody optimizes their compiler for it?!
Also how is SPEC NOT a specific application, it doesn't get any more specific than benchmarks that don't target any kind of workload, or even specific applications .

The disclaimer that it is now attached to over 2,600 SPEC CPU 2017 results states, "The compiler used for this result was performing a compilation that specifically improves the performance of the 523.xalancbmk_r / 623.xalancbmk_s benchmarks using a priori knowledge." This means the compiler (in this case, Intel's oneAPI DPC++/C++ Compiler) was not optimized for a particular kind of workload, or even for specific applications, but specifically for two SPEC CPU 2017 benchmarks.

Reply
-Fran-

I'll just leave this here.

https://www.cnet.com/science/amd-quits-benchmark-group-implying-intel-bias/
Regards.
Reply
bit_user

523.xalancbmk_r / 623.xalancbmk_s benchmarks using a priori knowledge." This means the compiler (in this case, Intel's oneAPI DPC++/C++ Compiler) was not optimized for a particular kind of workload, or even for specific applications, but specifically for two SPEC CPU 2017 benchmarks.
I think it's basically just one benchmark that's included in two different suites. Xalan is a XSLT processor developed under the umbrella of the Apache Software Foundation.

As for the _r and _s distinction, these signify rate vs. speed. SPEC explains them as follows:
"There are many ways to measure computer performance. Among the most common are:
Time - For example, seconds to complete a workload.
Throughput - Work completed per unit of time, for example, jobs per hour.
SPECspeed is a time-based metric; SPECrate is a throughput metric."

https://www.spec.org/cpu2017/Docs/overview.html#Q15
They further list several key differences, but these two jumped out at me:
For speed, 1 copy of each benchmark in a suite is run. For rate, the tester chooses how many concurrent copies to run.
For speed, the tester may choose how many OpenMP threads to use. For rate OpenMP is disabled.
I'm a little surprised by the latter point, but I guess it makes sense. What it means is that SPECspeed shouldn't be taken purely as a proxy for single-threaded performance. You really ought to use SPECrate for that, which I think is what I've seen.

Results for fifth-gen Xeon Emerald Rapids CPUs are very unlikely to have been running a version of the compiler with the banned optimization since Emerald Rapids came out after the good versions of the compiler were available.
I'm not sure how the author concludes this. I don't see anywhere to download previous versions of Intel's DPC++ compiler. The latest is 2024.0.2 and that release is dated Dec. 18th, 2023.

BTW, I wonder who tipped them off. Did someone just notice those results were suspiciously good and start picking apart the generated code, or did a disgruntled ex-Intel employee maybe drop a dime?
Reply
bit_user

TerryLaze said:
What the heck SPEC,
Standard Performance Evaluation Corporation
The System Performance Evaluation Cooperative, now named the Standard Performance Evaluation Corporation (SPEC), was founded in 1988 by a small number of workstation vendors who realized that the marketplace was in desperate need of realistic, standardized performance tests. The key realization was that an ounce of honest data was worth more than a pound of marketing hype.

SPEC publishes several hundred different performance results each quarter spanning a variety of system performance disciplines.

The goal of SPEC is to ensure that the marketplace has a fair and useful set of metrics to differentiate candidate systems. The path chosen is an attempt to balance requiring strict compliance and allowing vendors to demonstrate their advantages. The belief is that a good test that is reasonable to utilize will lead to a greater availability of results in the marketplace.

SPEC is a non-profit organization that establishes, maintains and endorses standardized benchmarks and tools to evaluate performance for the newest generation of computing systems. Its membership comprises more than 120 leading computer hardware and software vendors, educational institutions, research organizations, and government agencies worldwide.

https://www.spec.org/spec/
One neat thing about SPECbench is that you actually get it in the form of source code that you can compile and run just about anywhere. For years, Anandtech even managed to run it on iPhone and Android phone SoCs. This allowed them to compare performance and efficiency relative to desktop x86 and other types of CPUs.

As far as I'm aware, GeekBench is one of the only other modern, cross-platform benchmarks. However, unlike SPECbench, it's basically a black box. This makes it a ripe target for allegations of bias towards one kind of CPU or platform vs. others.

TerryLaze said:
so basically what you are saying is that you are using useless benchmarks that don't target any kind of workload, or even specific applications
No, SPECbench is comprised of real world, industry-standard applications.

TerryLaze said:
then you are salty that somebody optimizes their compiler for it?!
Yes. The article explains that the benchmark suite is intended to be predictive of how a given system will perform on certain workloads. If a vendor does highly-targeted compiler optimizations for the benchmark, those don't carry over to similar workloads and thus invalidate the benchmark. That undermines the whole point of SPECbench, which is why they need to take a hard line on this sort of activity.
Reply
JamesJones44

Let's be honest, who believes any of the benchmarks released by any host company? Not a day goes by where an independent benchmark looks different than what Apple, Intel, AMD, Nvidia, Micron, etc. stated as their benchmarks.
Reply
bit_user

JamesJones44 said:
Let's be honest, who believes any of the benchmarks released by any host company? Not a day goes by where an independent benchmark looks different than what Apple, Intel, AMD, Nvidia, Micron, etc. stated as their benchmarks.
That's not what this is about.

SPEC gets submissions for an entire system. As such, they're usually submitted by OEMs and integrators. There's a natural tendency to use the compiler suite provided by the CPU maker, since those have all of the latest and greatest optimizations and tuning for the specific CPU model. That's where the trouble started.

SPEC has various rules governing they way systems are supposed to be benchmarked, in order to be eligible for submission. It's a little like the Guinness book of World Records, or perhaps certain athletics bodies and their rules concerning official world records.
Reply
punkncat

What do you mean the new car I just purchased doesn't really get 40 MPG in real world conditions?

AGAST!

Why does this even qualify as news? This isn't anything novel or unheard of. Whispers about it going on for years now. If anyone is surprised, they are also naive...and I got some beachfront property to sell you...
Reply
TerryLaze

bit_user said:
Yes. The article explains that the benchmark suite is intended to be predictive of how a given system will perform on certain workloads. If a vendor does highly-targeted compiler optimizations for the benchmark, those don't carry over to similar workloads and thus invalidate the benchmark. That undermines the whole point of SPECbench, which is why they need to take a hard line on this sort of activity.
I don't get the distinction...
If it's predictive of possible compiler optimizations and intel actually did those compiler optimizations then what's the issue?!

It can't be both ways,
either these particular benches are useless, or intel optimized the compiler towards whatever the predicted use case was.
Reply

Show more comments