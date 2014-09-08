More Benchmark Results
Next up are our application-specific benchmarks, including c-ray 1.1, NAMD, NPB, p7zip, redis, and OpenSSL. At some point in the future, optimizations for Haswell-EP's advanced instructions may find their way into these titles. But for now, the performance we're reporting represents the current state of affairs. Specifically, AVX 2.0 would likely have a major impact on the results.
c-ray 1.1
Linux-Bench actually runs three different c-ray tests. The first is dubbed "easy", and is great for showing performance differences between Atom processors and desktop CPUs. We excluded that measurement because all three platforms finish it in under one second. Instead, we are using the much tougher command cat sphfract | ./c-ray-mt -t $threads -s $resolution -r 8 to demonstrate differences between these platforms.
Ray tracing generally scales well with both CPU frequency and core count; we see both trends in action as the Xeon E5-2600 v3s pull ahead.
While the 1920x1200 test responds readily to more execution resources, the 3840x2160 benchmark doesn't. Some of that may be due to the -2690 v3's 300 MHz per core advantage. Still, the scaling of the Xeon E5-2690 from one generation to the next is made obvious.
NAMD
Our NAMD tests use molecular modeling to tax these server platforms. For anyone involved in projects like Folding@Home, these are the types of workloads that fully utilize multi-threaded processors.
Haswell-EP has little trouble showing off its strengths.
The first-gen Xeon E5 and v2 results aren't what most folks would expect. However, Ivy Bridge-EP had a nasty habit of getting aggressive on power-saving, dropping all cores to lower P-states when demand dropped. That may be what's happening here. In contrast, the Xeon E5-2600 v3s control this on a per-core basis, so the impact of turning cores on and off isn't reflected as painfully in the performance benchmarks.
NPB
For a test with "Parallel Benchmark" in its name, we're expecting Haswell-EP's high core counts to yield big performance numbers.
hereas we see relatively pedestrian improvements going from first-gen Xeon E5-2690 the Haswell-EP-based variant, Intel's -2699 v3 finishes way ahead of the other CPUs. Since this was repeatable, I'm hypothesizing that the problem being solved fits into the big die's 45 MB L3 cache.
P7zip
p7zip is a standard compression benchmark. Generally, these types of algorithms are able to take advantage of many threads. I'd guess that the Haswell-EP parts are able to overcome small frequency deficits to finish with a lead, thanks to their IPC throughput advantage and core count.
There is a linear-looking performance improvement stepping between each generation of Intel's Xeon E5-2690. The Xeon E5-2699 v3 again shows off what extra cores can do in a workload able to utilize them, posting an approximately 2x increase over the first-gen Xeon E5-2690.
Redis
Redis is an in-memory application, so core count has less of an overall impact.
As I expected, the results fall much closer to each other, looking a lot like our STREAM results. Still, the configuration with one 16 GB DDR4 DIMM per channel does pull ahead.
OpenSSL
Again, OpenSSL is widely used, so this is perhaps one of the most applicable benchmarks for Web servers. Some companies are pushing for broader use of SSL to keep data encrypted, making the metric particularly important.
The Haswell-EP-based parts scale well. In particular, the Xeon E5-2699 v3 shows a greater than 2x performance improvement over Intel's once-top-of-the-line Xeon E5-2690.
Actually we should be trying to move away from traditional serial-styled processing and move towards parallel processing. Each core can handle only one task at a time and only utilize it's own resources by itself.
This is unlike a GPU, where many processors utilize the same resources and perform multiple tasks at the same time. The problem is that this type of architecture is not supported at all in CPUs and Nvidia is looking for people to learn to program for parallel styled architectures.
But this lineup of CPUs is clearly a marvel of engineering and hard work. Glad to see the server industry will truly start to benefit from the low power and finely-tuned abilities of haswell along with the recently introduced DDR4 which is optimized for low power usage as well. This, combined along with flash-based storage (aka SSDs) which also have lower power drain than the average HDD, will slash through server power bills and save companies literally billions of dollars. Technology is amazing isn't it?
However, with multiple cores, now we can have better AI and other "off-screen" items that don't necessarily always depend upon the user's direct input. There's still a lot of work to be done there, though.
I think all of the major server vendors are going to suck up all of the major memory manufacturers DDR4 capacity for a while before the prices go down.
Whether it helps or hinders will ultimately depend on the VM admin. What most VM admins don't realize is that HT can actually end up degrading performance in virtual environments unless the VM admin took specific steps to use HT properly (and most do not). A lot of companies will tell you to turn off HT to increase performance because they've dealt with a lot of VM admins that don't set things up properly (a lot of VM admins over allocate which is part of the reason using HT can degrade performance, but there are other settings as well that have to be set in the Hypervisor so that the guest VMs get the resources they need).
In many cases, trying to convert algorithms to threads is simply more trouble than it is worth.
A game is made by sound, logic and graphics. You may dedicate this 3 processes to a number of cores but they remain 3. As you split load some of the logic must recall who did what and where. Logic deals mainly with FPU units, while graphics with integers. GPUs are great integers number crunchers. They have to be fed by the CPU so an extra core manage data through different memories, this is where we start failing. Keeping all in one spot, with the same resources reduces need to transfer data. By implementing a whole processor with GPU, FPU, x86 and sound processor all in one package with on board memory makes for the ultimate gaming processor. As long as we render scenes with triangles we will keep using the legacy stuff. When the time will come to render scenes by pixel we will need a fraction of today's performance, and half of the texture memory (just scale the highest quality) and half of models memory. Epic is already working on that.
Great points. One minor complication is that the NVIDIA GeForce Titan used in the Haswell-E review would not have fit in the 1U servers (let alone be cooled well by then.) Onboard Matrox G200eW graphics are too much of a bottleneck for the standard test suite.
On the other hand, this platform is going to be used primarily in servers. Although there are some really nice workstation options coming, we did not have access in time for testing.
One plus is that you can run the tests directly on your own machine by booting to a Ubuntu 14.04 LTS LiveCD, and issuing three commands. There is a video and the three simple commands here: http://linux-bench.com/howto.html That should give you a rough idea in terms of performance of your system compared to the test systems.
Hopefully we will get some workstation appropriate platforms in the near future where we can run the standard set of TH tests. Thanks for your feedback since it is certainly on the radar.