Sign in with
Sign up | Sign in
Intel Xeon E5-2600 V3 Review: Haswell-EP Redefines Fast
By ,
1. Xeon E5-2600 v3 Platform Introduction

Today marks the launch of Intel's Xeon E5-2600 v3 processor family, based on the Haswell-EP design. We knew this day was coming, since the company already introduced its Haswell-E-based Core i7s. Of course, the Xeon family is Intel's mainstream server/workstation processor family, and the E5-2600 series is perhaps the highest volume line-up in the Xeon portfolio. It is also responsible for forcing AMD's Opteron 4000 and 6000 CPUs into relative submission. Now, the competition is refocusing efforts on low-end ARM-based processors.

The dual-socket server market is absolutely huge. So, any major technology refresh in the segment triggers billions of dollars in refresh purchases. HP already announced its new ProLiant Generation 9 servers and other vendors will roll out their own implementations starting today. Most server systems have a field life of three to five years. It follows, then, that Haswell-EP-based processors will replace platforms built on Nehalem-EP, Westmere-EP, and Sandy Bridge-EP. And unlike most desktop PCs, every dual-socket server can easily cost many thousand dollars.

As you are undoubtedly aware, there are three distinct lines under the Xeon banner. These E5s represent Intel's mid-range platform. The E3s more closely align with mainstream desktop core configurations, while the E7 tier is higher-end, scaling up to eight processors, many terabytes of system memory, and enabling RAS features for mission-critical applications. The E5 is a utility player of sorts, handling everything from heavily virtualized workloads to bare metal HPC applications. The "2" in the part number lets us know that we're looking at single- and dual-socket-capable parts. The "6" immediately following loses some of its meaning this time around. Previously, Sandy Bridge-EP- and Ivy Bridge-EP-based processors were also available as Xeon E5-2400s, which weren't as fully-featured. There is no Xeon E5-2400 v3 this time around, though. As of now, the E5s are 2600-series chips.

With Sandy Bridge-EP (Xeon E5-2600), we saw as many as eight cores manufactured using a 32 nm process. Ivy Bridge-EP (Xeon E5-2600 v2) benefited from a process shrink to 22 nm, enabling core counts as high as 12. Haswell-EP (Xeon E5-2600 v3) is being productized in configurations as wide as 18 cores. Each generation follows the core design and incorporates much of the technology that we see with the aligned consumer segment. That means, with Haswell-EP, voltage regulation circuitry moves on-package instead of residing on the motherboard. Another major change (already seen on the desktop) is Haswell-EP's LGA 2011-3 interface, which is not compatible with Sandy Bridge-EP, Ivy Bridge-EP, or the new Ivy Bright-EX's 2011-pin socket. The new interface facilitates DDR4 memory compatibility, delivering lower power, more density, and higher data rates than previous generations.

Here is a quick overview of the different model differences in the Intel Xeon E5-2600 v3 generation:

Clearly, the number of SKUs is massive. Intel tells us that three dies are used to create all of these different CPU models. Remember, many of the systems Haswell-EP will replace currently employ Westmere-EP, which allowed up to two sockets with six cores each. Common DDR3 data rates were 1066 and 1333 MT/s. Updating to Xeon E5-2600 v2 makes it possible to put two to three times as many cores into the same form factor and likely reduce power consumption at the same time.

Spanning four to 18 cores and up to 3.6 GHz base clock rates, Intel is enabling CPU models that are optimized for many different markets. Thermal design power ratings range from 55 to 145 W on the server side, and as high as 160 W for the Xeon E5-2687W v3 workstation part. That includes the fully integrated voltage regulator (FIVR) also seen on Intel's desktop-class Haswell processors.

One other note: this is the preliminary planned SKU composition. We know Intel is customizing processors for EMC, NetApp, and other large customers requiring specific feature sets. Those are generally not listed as public SKUs, though.

2. Meet Intel's Grantley Platform

In addition to its new Haswell-EP-based CPUs, there's a lot more to Intel's Grantley platform. We were given the following quick reference guide, which covers the basics:

There are a number of evolutionary changes to account for, but perhaps the biggest is Grantley's memory support. Four generations of server platforms dating back to Nehalem-EP utilized DDR3 RAM, and we've seen efforts to further tweak that standard for lower power use or greater density. Registered DDR4 DIMMs successfully achieve those improvements, additionally increasing throughput per channel.

Servers are often loaded up with RAM to handle more VMs or even to expand the space available for in-memory storage applications like memcached or redis. This typically requires more DIMMs per memory channel, imposing penalties on the peak data rate you're able to hit. DDR4 is designed to accommodate more DIMMs in a configuration without the performance penalty suffered by DDR3. And because it operates on a lower input voltage than even DDR3L, energy efficiency is built-in to the spec.

Of course, memory support is a product of the CPU's integrated memory controller. But not all system functions are built into Intel's processors yet. You still need a platform controller hub for a lot of the peripheral connectivity and I/O. The Wellsburg PCH, much like the already-reviewed X99 Express, exposes 10 SATA 6Gb/s ports. That's a significant upgrade to the Xeon E5-2600 v1 and v2 platform, where the focus was on adding optional SAS connectivity. Intel is clearly taking a different tact to coincide with the introduction of its NVMe-based SSDs. We're plenty happy with expanded SATA support, which is great for low-cost SSDs and traditional mechanical disks. High-performance storage is moving to the PCIe bus.

Other features include six USB 3.0 ports and eight second-gen USB connectors, useful for faster KVM cart access and accelerated boot from an internal VMware ESXi USB key installation. Several of the platforms we've seen in the lab are USB 3.0-only, in fact. That's a significant change from previous generations limited to USB 2.0.

The CPUs still enable 40 PCI Express 3.0 lanes, divisible into a number of different link configurations. This is a common feature on processors in the -EP range. With faster networking in this generation and a renewed focus on PCIe-based flash storage, all of that connectivity should go to good use.

Later in this article, I'll cover how power consumption and distribution change with Haswell-EP. The key is that, again, voltage regulation is on-package, and P-state control is more granular. As we saw on the desktop, this results in low idle power use. But unlike Haswell in its mainstream form, Haswell-EP packs up to 4.5x as many execution cores and more than five times the last-level cache. And in dual-CPU arrays, the effects of power savings are doubled per machine.

At least in my opinion, the most exciting platform change involves networking, including the 40 GbE Fortville controller...

3. Fortville: 40 GbE Ethernet For The Masses

Alongside the Grantley platform, Intel is introducing a new generation of Ethernet network adapters code-named Fortville. The controller is 40 Gb-capable, and it completely changes the game with regards to Intel's networking capabilities.

Given the massive performance increases enabled by today's systems, low-latency and high-bandwidth networking is essential. Grantley is built to consolidate an even greater number of virtual machines on a single server. Technologies like VMware vSAN utilize local storage to create distributed SANs for these virtual machines. Even trends like software-defined networking can benefit from higher port counts and greater networking performance. Fortville is Intel's solution.

There are three main configurations of the Fortville adapter: 2 x 40 GbE, 1 x 40 GbE, and 2 x 10 GbE. Compare that to the previous generation of Spring Fountain-based X520 adapters capable of up to 2 x 10 GbE. You quickly see the bandwidth "potential" goes up from 20 to 80 Gb/s on the Fortville X710 family.

While a 4x bandwidth increase sounds awesome, that ceiling is currently not possible to hit. Servers generally expose a bunch of eight-lane PCIe slots. Right now, the third-gen standard gives you a little less than 8 GB/s of throughput. Furthermore, there is always some overhead involved. So, hooked up to the Cray-Gnodal GS0018 (18 x 40 GbE) switch in the lab, we are seeing between 50 and 55 Gb/s of peak bandwidth. We dno't have enough data to publish formal numbers. However, there is definitely a bottleneck in play.

Still, in most deployments, the two QSFP ports will attach to different switches for failover. And there's plenty of headroom in the card to drive essentially a full 40 Gb connection, plus a solid amount through the second port.

Another aspect of 40 GbE-based products like the XL710 is that each QSFP port can utilize QSFP-to-4x SFP+ breakout cables. This allows for each XL710 card with dual QSFP ports to connect to up to 8x SFP+ 10 GbE devices. In theory, you can place eight of these dual-port cards in a server and then use that machine with 72 10 GbE network connections. The reasons not to are pretty obvious, but it's at least technically possible.

The other side of the Fortville story, aside from the controller's huge performance and density improvements, is power consumption. Fortville uses less power than the previous-gen X520 10 GbE adapters, both at idle and under load. The X520s had an 8.6 W TDP, while the XL710 generation is rated at 7 W. In a theoretical efficiency metric, Fortville delivers more than 3.5 times the throughput/watt compared to previous generation. That is a massive leap forward. Consistent with this, Fortville is rated at 3.6 W typical power consumption using two 40 GbE links, so 7 W TDP represents a lot of headroom.

At the end of the day, Intel's Fortville-based adapters should enable more bandwidth, lower latency, and higher port density, all at reduced power consumption.

4. How We Tested

Today's tests involve typical 1U server platforms. Supermicro sent along a new 1U SuperServer configured with two Intel E5-2690 v3 processors and 16 x 8 GB DDR4-2133 DIMMs from Samsung. We had a similar 1U Supermicro platform and pairs of Intel Xeon E5-2690 v1 and v2 processors to create a direct comparison. The Xeon E5-2690s are generally considered the higher-end of what ends up becoming mainstream. For example, companies like Amazon use the E5-2670 v1 and v2 quite extensively in their AWS EC2 compute platforms. The -2690 generally offers the same core count, just at a higher clock rate.

Intel also sent along a 2U "Wildcat Pass" server platform that was configured with two Xeon E5-2699 v3 samples and 8 x 16 GB registered DDR4 modules (with one DIMM per channel) and two SSD DC S3500 SSDs. The E5-2699 v3 is a massive processor. It wields a full 18 cores capable of addressing 36 threads through Hyper-Threading. Forty-five megabytes of shared L3 cache maintain 2.5 MB per core, and the whole configuration fits into a 145 W TDP.

Naturally, this is going to represent a lower-volume, high-dollar server. But it's going to illustrate the full potential of Haswell-EP, too. We're using the Wildcat Pass server as our control for Intel's newest architecture.

Meanwhile, a Lenovo RD640 2U server operates as our control for Sandy Bridge-EP and Ivy Bridge-EP. It leverages 8 x 16 GB of registered DDR3 memory, totaling 128 GB. We also dropped those SSD DC S3500s in there, too.

As we make our comparisons, keep a few points in mind. First, at the time of testing, DDR4 RDIMM pricing is absolutely obscene. Street prices are several times higher per gigabyte than DDR3. This will come down over time as manufacturing ramps up. But prohibitive expense did affect our ability to configure the servers with more than 128 GB.

We are focusing today's review on processor performance and power consumption. As a result, we are using the two SSD DC S3500s with 240 GB each in a RAID 1 array. We did have a stack of trusty SanDisk Lightning 400 GB SLC SSDs available. But neither of our test platforms came with SAS connectivity. Although there are plenty of add-in controllers that would have done the job, there is clearly a market shift happening away from such configurations. Sticking with SATA-based SSDs kept the storage subsystem's power consumption relatively low, while at the same time leaning on a fairly common arrangement in servers reliant on shared network storage.

Bear in mind also that we're using 1U and 2U enclosures, each with a single server inside. The Xeon E5 series is often found in high-density configurations with multiple nodes per 1U, 2U, or 4U chassis. For instance, the venerable Dell C6100, based on Nehalem-EP and Westmere-EP, was extremely popular with large Web 2.0 outfits like Facebook and Twitter. Many of those platforms have been replaced by OpenCompute versions, but we expect many non-traditional designs to be popular with the E5-2600 v3 generation, especially given its power characteristics.

5. Supermicro SYS-6018R-WTR

The main Xeon E5-2600 v3-based platform we've been using for testing is Supermicro's 1U SYS-6018R-WTR. It's largely an evolution of the SYS-6017R-WRF we already had in the lab. Although this is a 1U format, the server has redundancy built-in, and packs the space with features.

First, as a 1U chassis, there are only so many options for front-mounted storage. This particular chassis exposes four 3.5" front hot-swap bays. In our test system, bays one and two are populated with the Intel SSDs. There are standard LED indicators, as well as power and reset buttons. The rest of the chassis' face serves as a large vent, pulling air in and over the power-hungry components inside.

Redundant fans are responsible for this task. Essentially, two fans are spliced together. If one fails, the other continues to operate. In datacenters, emergency remote hands to replace a fan can cost $100. So, minimizing the need for urgent replacements saves money. Furthermore, with the redundant design, if a fan does fail, the system continues to receive some cooling.

In the same vein of redundancy, there are two 700 W 80 PLUS Platinum-rated PSUs in the rear of the chassis. Low-cost 1U designs, often sporting single Xeon E3s, are usually sold with one power supply to reduce costs. Higher-end servers like this one are meant to be fed with A+B power, and thus have the ability to weather a failure along one of the power delivery routes. That is to say each power supply is fully capable of keeping the server running independently.

The back of the chassis exposes fairly standard I/O. There are two built-in Ethernet ports and one IPMI/KVM-over-IP port for remote management. If you've ever experience a serious failure from a remote location and used KVM-over-IP to troubleshoot, you already know that this feature is awesome. Supermicro also enables four USB ports and a VGA output. Interestingly, this server does not have a dedicated serial port. You can always attach a USB-to-serial adapter if it's really needed.

Inside the server there are two expansion bays. For testing, we used one of the PCIe riser's slots for Supermicro's Fortville-based dual 40 GbE adapter.

Plastic guides ensure that air flows through the 1U heat sinks, RAM, and expansion cards in a focused fashion.

Our system came equipped with eight 8 GB Samsung DDR4-2133 ECC RDIMMs. When we received the test unit, these were very hard to purchase on the open market. You can see that the server has four memory slots on either side of each processor, totaling 16.

6. Linux-Bench Components And Test Setup

While Windows Server is still a popular platform, many Xeon E5-based servers are also going to run some flavor of Linux as an operating system. Server-oriented hardware also has relatively poor graphics capabilities. Most are able to render a single 2D terminal with a decent amount of latency, and that's about it. As a result, we are using a variety of Linux-based benchmarks to test the Xeon E5-2600 v3.

If you've manually configured and run benchmarks under Linux, then you know they can be an exercise in frustration. For this review, we are using a "simple" test script to automate running a few common Linux benchmarks. It's free and can be found at linux-bench.com or on GitHub. There is also a new Docker.io version of the script on GitHub, so it can be run using perhaps the hottest technology of the year. The bonus of using this type of script is that it is freely available to test on your own server.

It is designed to run off of a standard Ubuntu 14.04 LTS LiveCD using only three commands. As a result, it can even be run remotely on servers with no local disks installed via KVM-over-IP. We did run with a local LiveCD image, booting into a CLI environment before each iteration to ensure no artifacts were leftover from previous runs.

The Linux-Bench script itself does little other than install dependencies and run benchmarks. As a disclaimer, I am one of the community contributors to the script. However, I do not help maintain any of the individual benchmarks.

UnixBench 5.1.3

The byte-unixbench project can be found on Google Code here. However, it has roots dating back to 1983. It is an extremely popular suite that has a number of component tests like Dhrystone, Whetstone, and shell scripts. Specifically, we are interested in the CPU tests, so metrics for 2D/3D GPU and storage are excluded. Also, since these systems have many processing cores, we utilized the high CPU count patch.

c-ray 1.1

c-ray 1.1 is a popular and simple ray-tracing benchmark for Linux systems written by John Tsiombikas. It is designed so that, on most systems, it should not need to access RAM and therefore is highly sensitive to processor performance. You can find archived results, including those from SGI systems, here.

STREAM

STREAM is perhaps the seminal memory bandwidth application. The benchmark was created and is maintained by Dr. John D. McCalpin. More information can be found here.

OpenSSL

OpenSSL caused a stir with the now famous "Heartbleed" bug earlier in 2014. This is the technology that secures much of the Internet's data traffic, and is a common server application.

HardInfo

HardInfo is a simple benchmark that's popular in Linux-based environments. It is well known perhaps because the benchmark is installed by default on many Ubuntu desktop systems.

NAMD

NAMD is a molecular modeling benchmark. It was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. More information can be found here.

NPB

NPB or NAS Parallel Benchmarks are a set of computational fluid dynamics applications originally intended to benchmark parallel supercomputers for NASA. We are using only one node for our testing, though today's multiprocessor systems in some ways mirror parallel computers from many years ago. You'll find more information on NASA's site, here.

p7zip

7-Zip is a popular open source compression application. Servers compress data for storage purposes and also before transmitting. It is an extremely common tool and common application.

Redis

redis is a popular new Web technology to help online applications scale. This is an in-memory key value store, making it memory bandwidth- and CPU performance-bound. It's an emerging technology with a strong developer base.

Sysbench CPU

Sysbench is another venerable benchmarking application. It is extremely easy to use and, for this test, we are only focusing on CPU performance.

You can easily replicate these tests by downloading and running them individually. Using the Linux-Bench script's parameters have been profiled across over 100 different systems, from low-end Atoms up to quad-socket Xeon and Opteron servers, thanks to an active community participating and posting edits in GitHub.

7. Benchmark Results

For our first set of benchmarks, we are going to look at the most common suites we ran, including UnixBench (both in single and multi-threaded modes), HardInfo, sysbench, and STREAM.

UnixBench 5.1.3

One way that Intel keeps thermals manageable on the more complex Haswell-EP-based CPUs is scaling back clock rate. For example, the Xeon E5-2699 v3 operates at just 2.3 GHz, which is 300 MHz less than the -2690 v3. Single-threaded performance is still highly relevant in server workloads though, which is why Turbo Boost technology exists. A great example of this is Minecraft, which went from an obscure title to a phenomenon. The game server was bottlenecked by single-threaded performance, compelling many admins to use Xeon E3s in a quest for higher frequencies.

In our first UnixBench Whetstone/Dhrystone run, we ran the test in single-threaded mode.

Single-threaded Whetstone is relatively consistent between the three processors, despite a 700 MHz difference between the base clock rates of Intel's Xeon E5-2690 v2 and -2699 v3.

Single-threaded Dhrystone is a different story; the Xeon E5-2690 v1 pulls ahead by almost 10%. Despite the scaling of this chart, however, the results are really fairly close, even if we'd typically expect the architectural improvements rolled into Haswell to convey significant advantages over Sandy Bridge.

We can turn to the multi-threaded results to see more notable changes.

As we might expect, the threaded results illustrate that adding cores helps scale performance in workloads properly optimized for multi-core designs. The Xeon E5-2699 v3 puts up greater than 2x performance improvement versus the -2690 v1, which was top-of-the-line in its day.

We clearly see the evolution of Intel's Xeon E5-2690 line-up from its first iteration to the v3 version. The other standout is the Xeon E5-2699 v3, which shows that 18 cores and 36 threads per processor deliver huge gains in a parallelized task, particularly compared to the once-fastest Xeon E5-2690.

HardInfo

This is certainly less dramatic than our Whetstone and Dhrystone results, but there is still solid scaling.

Our next tests are the Fibonacci sequence calculation and FPU FFT module.

Higher core counts again benefit the v3 processors.

In all three metrics, we see linear improvements from one generation to the next, as the Xeon E5-2699 v3 pulls ahead. Intel's original Xeon E5-2690 was a 2.9 GHz part, and the -2690 v2 stepped up to 3 GHz, so the fact that lower-frequency v3s maintain a lead is telling.

Sysbench CPU

Searching for prime numbers is a math problem that can be parallelized easily. As a result, it scales well with additional cores.

The Haswell-EP parts are on par with Sandy Bridge-EP and Ivy Bridge-EP when it comes to single-threaded performance. Of course, we know from the growing core counts that Intel is putting its emphasis on extra execution resources, rather than burning TDP on peak clock rates. So, maintaining the status quo there was likely deemed acceptable. But load down all available cores and it's easy to see where Haswell-EP has its greatest impact.

STREAM

We did make one adjustment to the test configurations before running these tests. After noticing our control server with the Xeon E5-2699 and Supermicro-sourced boxes were scoring similarly, we decided to create a little side experiment, giving the -2699 v3 four 16 GB DDR4 DIMMS per processor. The Xeon E5-2690 v3 received eight 8 GB RDIMMs per processor to match the first-gen and v2 platforms.

The results show both the impacts of adding more memory and the nice scaling we get moving from 1600 MT/s DDR3L to 2133 MT/s DDR4. There is clearly a performance benefit attributable to the new standard; it's not just about power consumption.

8. More Benchmark Results

Next up are our application-specific benchmarks, including c-ray 1.1, NAMD, NPB, p7zip, redis, and OpenSSL. At some point in the future, optimizations for Haswell-EP's advanced instructions may find their way into these titles. But for now, the performance we're reporting represents the current state of affairs. Specifically, AVX 2.0 would likely have a major impact on the results.

c-ray 1.1

Linux-Bench actually runs three different c-ray tests. The first is dubbed "easy", and is great for showing performance differences between Atom processors and desktop CPUs. We excluded that measurement because all three platforms finish it in under one second. Instead, we are using the much tougher command cat sphfract | ./c-ray-mt -t $threads -s $resolution -r 8 to demonstrate differences between these platforms.

Ray tracing generally scales well with both CPU frequency and core count; we see both trends in action as the Xeon E5-2600 v3s pull ahead.

While the 1920x1200 test responds readily to more execution resources, the 3840x2160 benchmark doesn't. Some of that may be due to the -2690 v3's 300 MHz per core advantage. Still, the scaling of the Xeon E5-2690 from one generation to the next is made obvious.

NAMD

Our NAMD tests use molecular modeling to tax these server platforms. For anyone involved in projects like Folding@Home, these are the types of workloads that fully utilize multi-threaded processors.

Haswell-EP has little trouble showing off its strengths.

The first-gen Xeon E5 and v2 results aren't what most folks would expect. However, Ivy Bridge-EP had a nasty habit of getting aggressive on power-saving, dropping all cores to lower P-states when demand dropped. That may be what's happening here. In contrast, the Xeon E5-2600 v3s control this on a per-core basis, so the impact of turning cores on and off isn't reflected as painfully in the performance benchmarks.

NPB

For a test with "Parallel Benchmark" in its name, we're expecting Haswell-EP's high core counts to yield big performance numbers. 

hereas we see relatively pedestrian improvements going from first-gen Xeon E5-2690 the Haswell-EP-based variant, Intel's -2699 v3 finishes way ahead of the other CPUs. Since this was repeatable, I'm hypothesizing that the problem being solved fits into the big die's 45 MB L3 cache.

P7zip

p7zip is a standard compression benchmark. Generally, these types of algorithms are able to take advantage of many threads. I'd guess that the Haswell-EP parts are able to overcome small frequency deficits to finish with a lead, thanks to their IPC throughput advantage and core count.

There is a linear-looking performance improvement stepping between each generation of Intel's Xeon E5-2690. The Xeon E5-2699 v3 again shows off what extra cores can do in a workload able to utilize them, posting an approximately 2x increase over the first-gen Xeon E5-2690.

Redis

Redis is an in-memory application, so core count has less of an overall impact.

As I expected, the results fall much closer to each other, looking a lot like our STREAM results. Still, the configuration with one 16 GB DDR4 DIMM per channel does pull ahead.

OpenSSL

Again, OpenSSL is widely used, so this is perhaps one of the most applicable benchmarks for Web servers. Some companies are pushing for broader use of SSL to keep data encrypted, making the metric particularly important.

The Haswell-EP-based parts scale well. In particular, the Xeon E5-2699 v3 shows a greater than 2x performance improvement over Intel's once-top-of-the-line Xeon E5-2690.

9. Power Consumption Results

Intel has this cadence where its latest architectures roll out on the mobile/desktop side, and are then followed in the high-end workstation/server space. We saw the benefits of 22 nm manufacturing first from Ivy Bridge, and then with Ivy Bridge-EP. Then, Haswell integrated the platform's voltage regulation circuitry for tighter control of power on-package. Needless to say, we were expecting a notable improvement from Haswell-EP to follow some of the gains already measured using desktop-class offerings, and we got it.

Haswell-EP incorporates a number of technologies that we anticipated bettering the power consumption story. First, it brings much of the power delivery from Intel's first- and second-gen Xeon E5s on-package. Haswell-EP can control P-states on a per-core basis, allowing them to be granularly spun up and down as demand dictates. Intel claims up to 36%-lower power consumption from its Per-Core P-States (PCPS.)

Then there's DDR4 memory, which, in addition to increasing data rates, also employs a lower input voltage. In most desktops, sub-1 W savings per module doesn't mean much. In a server, however, where you might have eight DIMMs per CPU and multiple processors per node, hacking away at total platform power a few watts at a time adds up quickly.

For companies looking to upgrade from three-generation-old Westmere-EP processors, there is another major difference. The new Wellsburg PCH (Intel C610 series) runs extremely cool and can control 10 SATA-based drives using standard ports. Back when Westmere was modern, the only way to get lots of PCIe connectivity in a server for expanded storage was adding a second IOH36 chip. That component required a decent sink and plenty of airflow to cope with its heat. When two of them were in a system, they became a significant cooling consideration. Since Haswell-EP employs 40 lanes of PCIe 3.0 on-die, and is mated to a C610 PCH manufactured on a newer process, you get big platform power savings compared to pre-Sandy Bridge-EP systems. Servers sporting Westemere-EP started coming off of lease en masse in early 2014, so they're the ones most likely to be replaced by Haswell-EP.

To generate some hard data, we took our 1U Supermicro test bed and allowed it to idle with no PCIe expansion cards installed (only the on-board networking controllers were active). The results were awesome:

Remember, our Haswell-EP-based server sports two 135 W Xeon E5-2690 v3s and 16 eight-gigabyte memory modules. It also uses redundant cooling, which is great in a datacenter environment, but not particularly power-friendly. Even still, the takeaway is that the Haswell-EP-based system's idle power consumption is extremely low.

Then we fired up three instances of well-threaded tests using c-ray 1.1, sysbench CPU (prime solver), and STREAM concurrently. The results were interesting; mainly, the Xeon E5-2699 v3 drew quite a bit more power than we were expecting. Granted, in many of our performance benchmarks, those same CPUs deliver greater than 2x improvements over the first-gen Xeon E5-2690. That's what will trigger consolidation of older machines into fewer Xeon E5-2600 v3-based boxes during this refresh cycle.

10. Haswell-EP Evolves The Server And Workstation

The night Chris Angelini was writing his review of the Core i7-4770K for Tom's Hardware and I was doing the same, we both reached similar conclusions: Haswell on the desktop is not a big deal. But Haswell-EP is a completely different story. Intel uses its advanced manufacturing to enable more cores, more cache, and a redesigned memory controller able to support DDR4. All of that comes together to yield a big step up compared to Ivy Bridge-EP. When you consider that these CPUs replace parts in servers with four to eight cores, the potential gains are substantial. Delivering twice the performance in a similar form factor makes it easy for any business to at least consider consolidating their hardware infrastructure.

When it comes to power consumption, we already know that Haswell was designed to service the mobile space. This has some favorable implications in the server world too. Of course, the difference is that Haswell-EP-based CPUs are much larger (and multiplied in a dual-socket configuration), so all gains are amplified.

In terms of performance per core, unless your software is optimized to exploit AVX 2.0, Haswell's biggest benefits come from the architecture's inherent IPC tweaks. Where Haswell-EP really shines is its higher core counts that help scale performance accordingly in well-parallelized workloads.

DDR4 memory support is perhaps the most next-generation aspect of Intel's new Xeon E5-2600 v3 processors. In time, we will likely see higher data rates, increased density, and potentially lower-power versions of the standard. Unlike DDR3, DDR4 is still supply-constrained, so new servers are going to be priced higher until memory vendors catch up. Right now, the market is split. Most consumer devices are tied to DDR3; Haswell-E/EP is the first design pushing DDR4. That'll change slowly. But for now, there are quantifiable power and performance benefits to justify the eventual adoption of a what currently appears to be a ridiculously expensive technology.

Reflecting on the press day that Intel hosted to introduce Haswell-EP, higher core counts, DDR4, and advanced ISA support were the most obvious platform changes. But the company's Fortville adapters are arguably even more exciting to me. The doors opened by a low-power controller capable of two 40 GbE interfaces or eight 10 Gb links cannot be ignored. I have been using Mellanox ConnectX-3 VPI adapters for quite some time in 40 Gb Ethernet mode. But the power consumption benefits of Intel's technology compelled me to go out and buy a new 40 Gb Ethernet switch.

Truly, this is the march of progress. More IPC throughput, a greater number of cores, more memory, and beefier I/O to exploit the platform's bolstered data handling capabilities translate to further consolidation of workloads. Intel is clearly driving towards a software-defined vision and takes a major step toward that goal with its Xeon E5-2600 v3 introduction. Then again, the way Intel presents its strategy addresses a more complete datacenter solution. Much like HP, Intel no longer pitches the Xeon as a new, faster processor on its own (even if it is). Instead, the company has a holistic goal for driving compute, storage, and networking performance over the next few years. Haswell-EP is the showcase for that.