IBM, Stone Ridge Technology, Nvidia Break Supercomputing Record

It's no secret that GPUs are inherently better than CPUs for complex parallel workloads. IBM's latest collaborative effort with Stone Ridge Technology and Nvidia shined a light on the efficiency and performance gains for reservoir simulations used in oil and gas exploration. The oil and gas exploration industry operates on the cutting edge of computing due to the massive data sets and complex nature of simulations, so it is fairly common for companies to conduct technology demonstrations using the taxing workloads. 

The effort began with 30 IBM Power S822LC for HPC (Minsky) servers outfitted with 60 IBM POWER8 processors (two per server) and 120 Nvidia Tesla P100 GPUs (four per server). The servers employed Nvidia's NVLink technology for both CPU-to-GPU and peer-to-peer GPU communication and utilized Infiniband EDR networking.

The companies conducted a 100-billion-cell engineering simulation on GPUs using Stone Ridge Technology's ultra-scalable ECHELON petroleum reservoir simulator. The simulation modeled 45 years of oil production in a mere 92 minutes, easily breaking the previous record of 20 hours. The time savings are impressive, but they pale in comparison to the hardware savings. 

ExxonMobil set the previous 100-billion-cell record in January 2017. ExxonMobil's effort was quite impressive--the company employed 716,800 processor cores spread among 22,400 nodes on NCSA's Blue Waters Super Computer (Cray XE6). That setup requires half a football field of floor space, whereas the IBM-powered systems fit within two racks and occupy roughly half of a ping-pong table. The GPU-powered servers also required only one-tenth the power of the Cray machine. 

The entire IBM cluster weighs in at roughly $1.25 to $2 million depending upon the memory, networking and storage configuration, whereas the Exxon system would cost in the hundreds of millions of dollars. As such, IBM claims it offers faster performance in this particular simulation at 1/100th the cost.

Such massive simulations are rarely used in the field, but it does highlight the performance advantages of using GPUs instead of CPUs for this class of simulation. Memory bandwidth is a limiting factor in many simulations, so the P100's memory throughput is a key advantage over the Xeon processors used in the ExxonMobil tests. From Stone Ridge Technology's blog post outlining its achievement:

“On a chip to chip comparison between the state of the art NVIDIA P100 and the state of the art Intel Xeon, the P100 deliver 9 times more memory bandwidth. Not only that, but each IBM Minsky node includes 4 P100’s to deliver a whopping 2.88 TB/s of bandwidth that can address models up to 32 million cells. By comparison two Xeon’s in a standard server node offer about 160GB/s (See Figure 3). To just match the memory bandwidth of a single IBM Minsky GPU node one would need 18 standard Intel CPU nodes. The two Xeon chips in each node would likely have at least 10 cores each and thus the system would have about 360 cores.”

Xeons are definitely at a memory throughput disadvantage, but it would be interesting to see how a Knight's Landing-equipped (KNL) cluster would stack up. With up to 500 GBps of throughput from on-package Micron HBM, they easily beat Xeon's memory throughput. Intel also claims KNL offers up to 8x the performance-per-Watt compared to Nvidia's GPUs, and though it's noteworthy that comparison is against previous-generation Nvidia products, that's a big gap that likely wasn't closed in a single generation.

IBM feels that its Power Systems paired with Nvidia GPUs can help other fields, such as computational fluid dynamics, structural mechanics, and climate modeling, among others, to reduce the amount of hardware and cost required for complex simulations. The massive nature of this simulation is hardly realistic for most oil and gas companies, but Stone Ridge Technology has also conducted 32-million-cell simulations on a single Minsky node, which might bring an impressive mix of cost and performance to bear for smaller operators.

Create a new thread in the News comments forum about this subject
This thread is closed for comments
25 comments
Comment from the forums
    Your comment
  • bit_user
    Quote:
    Such massive simulations are rarely used in the field, but it does highlight the performance advantages of using GPUs instead of CPUs for this class of simulation.

    Interestingly, Blue Waters still has about 10x - 20x the raw compute performance of the new system. It's also pretty old, employing Kepler-era GK110 GPUs.

    https://bluewaters.ncsa.illinois.edu/hardware-summary

    So, this really seems like a story about higher-bandwidth, lower-latency connectivity, rather than simply raw compute power. Anyway, progress is good.

    Quote:
    it would be interesting to see how a Knight's Landing-equipped (KNL) cluster would stack up.
    A Xeon Phi x200 is probably about half as fast as a GP100, for tasks well-suited to the latter. It's got about half the bandwidth and half the compute. Where it shines is if your task isn't so GPU-friendly, or if it's some legacy code that you can't afford (or don't have access to the source code) to rewrite using CUDA.

    Regarding connectivity, the x200 Xeon Phi's are at a significant disadvantage. They've only got (optional) one 100 Gbps Omnipath link, whereas the GP100's each have 3x NVLink ports.
    1
  • Matt1685
    Exxon and Stone Ridge Technology says they simulated a billion cell model, not 100 billion cell model.

    As far as Intel's claims about KNL, take them with a grain of salt. Firstly, they were probably making a comparison against Kepler processors, which is two generations from Pascal, not one. The P100 probably has 4 times the energy efficiency of the processor they used for the comparison. Secondarily any comparison of performance depends highly on the application. Thirdly, KNL uses vector processors which I think might be more difficult in practice to use at high efficiency than NVIDIA's scalar ALUs. Fourthly, KNL is restricted to a 1 processor per node architecture which makes it harder to scale a lot of applications when compared to Minsky's 2 CPU + 4 GPU per node topology. There would be much less high speed memory per node as well as much lower aggregate memory bandwidth per node. BIT_USER has already pointed out some of these things.

    In reply to BIT_USER, the way it sounds, Exxon only used the XE CPU nodes for the simulation and not the XK GPU nodes. Each XE node consists of 2 AMD 6276 processors that came out in November 2011. NCSA counts each 6276 CPU as having 8 cores, but each CPU can run 16 threads concurrently and Exxon seems to be counting the threads to get their 716,800 processors. The XE nodes have 7.1 PF of peak compute according to NCSA. The P100s in the Stone Ridge simulation have about 1.2 PF. So, the increased performance has a lot to do with the increased memory bandwidth available, and perhaps with the topology (fewer, stronger nodes) and the more modern InfiniBand EDR interconnect.

    Also, as far as using KNL when you can't afford to rewrite using CUDA, NVIDIA GPUs are generally used in traditional HPC using OpenACC or OpenMP. They don't need to use CUDA. They still seem to perform well used thus with respect to Xeon Phis, both in terms of performance and the time it takes to optimize code, as HPC customers are still using GPUs. Stone Ridge, however, surely has optimized CUDA code.

    Blue Waters was built in 2012, so the supercomputer is 4-5 years old. The Stone Ridge Technology simulation is still very impressive, but the age of Blue Waters should be kept in mind when considering the power efficiency, cost, and performance comparisons.
    0
  • bit_user
    Anonymous said:
    Thirdly, KNL uses vector processors which I think might be more difficult in practice to use at high efficiency than NVIDIA's scalar ALUs.
    I was with you until this. Pascal's "Streaming Multiprocessor" cores each have 4x 16-lane (fp32) SIMD units. All GPUs rely on SIMD engines to get the big compute throughput numbers they deliver. KNL features only two such pipelines per core.

    Aside from getting good SIMD throughput, KNL (AKA Xeon Phi x200) cores are no more difficult to program than any other sort of x86 core. That said, its "cluster on a chip" mode is an acknowledgement of the challenges posed by offering cache coherency at such a scale.

    Anonymous said:
    So, the increased performance has a lot to do with the increased memory bandwidth available, and perhaps with the topology (fewer, stronger nodes) and the more modern InfiniBand EDR interconnect.
    Right. That's what I was getting at.

    Anonymous said:
    NVIDIA GPUs are generally used in traditional HPC using OpenACC or OpenMP. They don't need to use CUDA.
    This still assumes even recompilation is an option. For most HPC code, this is a reasonable assumption. But, the KNL Xeon Phi isn't strictly for HPC.

    Anonymous said:
    Blue Waters was built in 2012, so the supercomputer is 4-5 years old. The Stone Ridge Technology simulation is still very impressive, but the age of Blue Waters should be kept in mind when considering the power efficiency, cost, and performance comparisons.
    Right. I was also alluding to that.

    Progress is good, but it's important to look beyond the top line numbers.

    Anyway, the next place where HMC/HBM2 should have a sizeable impact is in mobile SoCs. We should anticipate a boost in both raw performance and power efficiency. And when it eventually makes its way into the desktop CPUs from Intel and AMD, high-speed in-package memory will remove a bottleneck that's been holding back iGPUs, for a while.
    0