- CPUs
- Slideshow
SuperComputing 2017 Slideshow
- LiquidMIPS Immersion Cooling Nvidia GTX Titan X
- PEZY Immersion Cooling Nvidia Volta GV100’s
- PEZY SC2 Brick Immersion Cooling
- Allied Control Immersion Cooling 20 Tesla GPUs
- Allied Control Immersion Cooling Six Nvidia GeForce 1060’s
- SCINet Network Operations Center At SuperComputing 2017
- Xilinx FPGA And EPYC Server
- Xilinx Facial Detection Demo
- University of Edinburgh Homemade Raspberry Pi Cluster
- Latest in CPUs
SuperComputing 2017
Picture 1 of 32The annual SuperComputing 2017 conference finds the brightest minds in HPC (High Performance Computing) descending from all around the globe to share the latest in bleeding-edge supercomputing technology. The six-day show is packed with technical breakout technical sessions, research paper presentations, tutorials, panels, workshops, and the annual student cluster competition.
Of course, the conference also features an expansive show floor brimming with the latest in high-performance computing technology. We traveled to Denver, Colorado to take in the sights and sounds of the show and found an overwhelming number of exciting products and demonstrations. Some of them are simply mind-blowing. Let's take a look.
LiquidMIPS Immersion Cooling Nvidia GTX Titan X
Picture 2 of 32HPC workloads are rapidly transitioning to AI-centric workloads that are a natural fit for expensive purpose-built GPUs, but it's common to find standard desktop GPUs used in high-performance systems. LiquidMIPS designs custom two-phase Flourinert immersion cooling systems for HPC systems. The company displayed an Nvidia GeForce GTX Titan X overclocked to 2,000 MHz (120% of TDP) running at a steady 57C.
Supercomputer cooling is a challenge that vendors tackle in a number of ways. Immersion cooling affords several advantages, primarily in thermal dissipation and system density. Simply put, using various forms of immersion cooling can increase cooling capacity and allow engineers to pack more power-hungry components into a smaller space. Better yet, it often provides significant cost savings over air-cooled designs.
LiquidMIPS' system is far too large to sit on your desk, but we found some other immersion cooling systems that might be more suitable for that task.
The Nvidia Volta Traveling Roadshow
Picture 3 of 32The rise of AI is revolutionizing the HPC industry. Nvidia's GPUs are one of the preferred solutions for supercomputers, largely due to the company's decade-long investment in CUDA. The widely-supported parallel computing and programming model enjoys broad industry uptake.
The company had a series of "Volta Tour" green placards peppered across the show floor at each Nvidia-powered demo. The Tesla Volta GV100 GPUs were displayed in so many booths, and with so many companies, that we lost count. We've picked a few of the most interesting systems to share, but there were far too many Volta demos to share them all.
Easy PEZY Triple-Volta GV100 Blade
Picture 4 of 32PEZY Supercomputing may not be a household name, but the Japanese company's immersion-cooled systems are a disruptive force in the industry. Its ZettaScaler-2.2 HPC system (Xeon-D) is the fourth fastest in the world, with 19 PetaFLOPS of computing power.
But being the absolute fastest isn't really PEZY's goal. The company holds the top three spots on the Green500, which is a list of the most power-efficient supercomputers in the world. The company's Xeon-D powered models offer up to 17.009 GigaFLOPS per Watt, which is an efficiency largely made possible by immersion cooling with 3M's Flourinert Electronic Liquid.
PEXY had its new ZettaScaler-2.4 solution on display at the show. The system begins with a single "sub-brick" with three Nvidia Tesla V100-SXM2 GPUs paired with an Intel Skylake-F Xeon processor. The GPUs communicate over the speedy NVLink 2.0 interface, which offers up to 150GB/s of throughput, and the system communicates via Intel's 100Gb/s Omni-Path networking.
Now, let's see how they cool them.
PEZY Immersion Cooling Nvidia Volta GV100s
Picture 5 of 32PEZY outfits the GPUs and processor with standard heatsinks and submerges them in an immersion cooling tank. The company inserts six "sub-bricks" into slotted canisters to create a full "brick." Sixteen bricks, wielding 288 Tesla V100s, can fit into a single tank. The bricks communicate with each other over four Omni-Path 48 port switches. A single tank includes 4.6TB of memory.
A single tank generates a peak of 2.16 double-precision PetaFLOPS and 34.56 half-precision PetaFLOPS. The hot-swappable modules can easily be removed for service while the rest of the system chugs along. Some PEZY-powered supercomputers use up to 26 tanks in a single room. That means the company can pack in 7,488 Tesla V100s and 2,496 Intel Xeons into a comparatively small space.
We'll revisit one of PEZY's more exotic solutions (yeah, they've got more) a bit later.
AMD's EPYC Breaks Cover
Picture 6 of 32AMD began discussing its revolutionary EPYC processors at Supercomputing 2016, but this year it finally arrived to challenge Intel's dominance in the data center. Several other contenders, such as Qualcomm's Centriq and Cavium's ThunderX2 ARM processors, are also coming to the forefront, but AMD holds the advantage of the x86 instruction set architecture (ISA). That means its gear is plug-and-play with most common applications and operating systems, which should speed adoption.
AMD's EPYC also has other advantages, such as copious core counts at a lower per-core pricing than Intel's Xeons. It also offers 128 PCIe 3.0 lanes on every model. The strong connectivity options play particularly well to the single-socket server space. In AMD's busy booth, we found a perfect example of the types of solutions EPYC enables.
This single-socket server wields 24 PCIe 3.0 x4 connectors for 2.5" NVMe SSDs. These SSDs are connected directly to the processor, which eliminates the need for HBAs, thereby reducing cost. Even though the NVMe SSDs consume 96 PCIe 3.0 lanes, the system still has an additional 32 lanes available for other additives, such as high-performance networking adapters. That's simply an advantage that Intel cannot match in a single-socket solution.
We saw EPYC servers throughout the show floor, including from the likes of Tyan and Gigabyte. Things are looking very positive for AMD's EPYC, and we expect to see many more systems next year, such as HPE's recently announced ProLiant DL385 Gen 10 systems. Dell EMC is also bringing EPYC PowerEdge servers to market this year.
Trust Your Radeon Instincts
Picture 7 of 32AMD recently began shipping the three-pronged Instinct lineup for HPC applications. Nvidia's Volta GPUs may have stolen Supercomputing show, but they have also been on the market much longer. We found AMD's Radeon Instinct MI25 in Mellanox's booth. The Vega-powered MI25 slots in for heavy compute-intensive training workloads, while the Polaris MI6 and Fiji MI8 slot in for less-intense workloads, such as inference.
AMD also announced the release of a new version of ROCm, an open-source set of programming tools, at the show. Version 1.7 supports multi-GPU deployments and adds support for the popular TensorFlow and Caffe machine learning frameworks.
AMD Radeon Instincts In A BOXX
Picture 8 of 32Stacking up MI25s in a single chassis just became easier with AMD's release of ROCm 1.7, and here we see a solution from BOXX that incorporates eight Radeon Instinct MI25s into a dual-socket server. It's not a coincidence that AMD's EPYC processors expose 128 PCIe 3.0 lanes to the host. AMD is the only vendor that manufactures both discrete GPUs and x86 data processors, which the company feels will provide it an advantage in tightly-integrated solutions.
BOXX also recently added an Apexx 4 6301 model with an EPYC 16-core/32-thread processor to its workstation lineup.
Here Comes The ThunderX2
Picture 9 of 32Red Hat Linux recently announced that it has finally released Red Hat Enterprise Linux (RHEL) for ARM after seven long years of development. The new flavor of Linux supports 64-bit ARMv8-A sever-class SoCs, which paves the way for broader industry adoption. As such, it wasn't surprising to find HPE's new Apollo 70 system in Red Hat's booth.
Cavium's 14nm FinFET ThunderX2 processor offers up to 54 cores running at 3.0GHz, so a single dual-socket node can wield 108 cores of compute power. Recent benchmarks of the 32C ThunderX2 SoCs came from the team building the "Isambard" supercomputer at the University of Bristol in the UK. The team used its 8-node cluster to run performance comparisons to Intel's Broadwell and Skylake processors. The single-socket ThunderX2 system proved to be faster in OpenFOAM and NEMO tests, among several others.
Gigabyte had its newly-anointed ThunderX2 R181 servers on display, while Bull/Atos and Penguin also displayed their latest ThunderX2 wares. Cray also demoed its XC50 supercomputer, which will be available later this year. We can expect to hear of more systems from leading-edge OEMs in the near future.
IBM's Power9 To The Summit Supercomputer Rescue
Picture 10 of 32The battle for the leading position in the Top500 list of supercomputers is pitched, and Oak Ridge National Laboratory's new Summit supercomputer, projected to be the fastest in the world, should rocket the U.S. back into the lead over China. IBM's Power Systems AC922 nodes serve as the backbone of Summit's bleeding-edge architecture. We took a close look inside the Summit server node and found two Power9 processors paired with six Nvidia GV100s. These high-performance components communicate over both the PCIe 3.0 interface and NVLink 2.0, which provides up to 100GB/s of throughput between the devices.
Summit will offer ~200 PetaFLOPS, which should easily displace China's (current) chart-topping 93-PetaFLOPS Sunway TaihuLight system. We took a deep look inside the chassis in a recent article, so head there for more detail.
Qualcomm's Centriq 2400 Makes An Entrance
Picture 11 of 32Qualcomm's Centriq 2400 lays claim to the title of the industry's first 10nm server processor, though Qualcomm's 10nm process is similar to Intel's density on the 14nm node. Like the Cavium ThunderX2, the Centriq processors significantly benefit from Red Hat's recently added support for the ARM architectures. For perspective, every single supercomputer on the Top500 list now runs on Linux.
The high-end Centriq 2460 comes bearing 48 "Falkor" 64-bit ARM v8-compliant cores garnished with 60MB of L3 cache. The processor die measures 398mm2 and wields 18 billion transistors.
The chip operates at a base frequency of 2.2 GHz and boosts up to 2.6 GHz. Surprisingly, the 120W processor retails for a mere $1,995. Qualcomm also has 46 and 40 core models on offer for $1,373 and $888, respectively. A quick glance at Intel's recommended pricing for Xeon processors drives home the value. Prepare to spend $10,000 or more for comparable processors.
Performance-per-watt and performance-per-dollar are key selling points of ARM-based processors. Qualcomm has lined up a bevy of partners to push the platform out to market, and considering the recent Cloudflare benchmark results, the processors are exceptionally competitive against Intel's finest.
D-Wave's Mindbending Quantum Computing
Picture 12 of 32Quantum computing may be the next wave in supercomputing, and a number of companies, including industry behemoths like Intel, Google, and IBM, are drawing the battle lines. Those three industry icons are focusing on using Quantum logic gates, which are more accessible to programmers and mathematicians, but are unfortunately harder to manufacture.
D-Wave's latest quantum processor can search a solution space that is larger than the number of particles in the observable universe--and it accomplishes the feat in 1 microsecond. The company has embraced annealing, which has more robust error correction capabilities. D-Wave announced at the show that its 2000Q system now supports reverse annealing, which can solve some problems up to 150X faster than the standard annealing approach.
D-Wave's quantum chips are built on a standard CMOS process with an overlay of superconducting materials. The company then cools the processor to 400X colder than interstellar space to enable processing. D-Wave's 2,000-Qubit 2000Q supercomputer comes with a hefty $15 million price tag, though the company also offers a cloud-based subscription service.
PEZY's 16,384-thread SC2 Processor
Picture 13 of 32It's hard not to be enamored with PEZY's ultra-dense immersion-cooled supercomputers, but the PEZY-SC2 Many-Core processor is truly a standout achievement.
The 2,048-core chip (MIMD) operates at 1.0 GHz within a 130W TDP envelope. The chip features 56MB of cache memory and supports SMT8, so each core executes eight threads in parallel. That means a single PEZY-SC2 processor brings 16,384 threads to bear. It also features the PCIe Gen 4.0 interface.
Among other highlights, the chip sports a quad-channel memory interface that supports up to 128GB of memory with 153GB/s of bandwidth. That may seem comparatively small, but PEZY's approach relies upon deploying these processors en masse on a spectacular scale.
The 620mm2 die, built on TSMC's 16nm process, offers 8.192 TeraFLOPS of single-precision and 4.1 TeraFLOPS of double-precision compute.
Let's see how PEZY integrates these into a larger supercomputer.
The Pezy-SC2 Brick
Picture 14 of 32PEZY uses its custom -SC2 processors, which serve as accelerators, to create modules which are then assembled into bricks, much like we saw with the company's V100 blades. The PEXY-SC2 brick consists of 32 modules that all have their own allotment of DRAM, and one 16-core Xeon-D processor is deployed per eight PEZY-SC2 processors. The brick also includes slots for four 100Gb/s network interface cards.
Now the completed brick is off to the immersion tank.
PEZY-SC2 Brick Immersion Cooling
Picture 15 of 32A single ZettaScaler immersion tank can house 16 bricks, which works out to 512 PEZY-SC2 processors. That means each tank can bring in 1,048,576 cores and 8,388,608 threads. That's an incredibly mind-bending number, sure, but perhaps the size of the relatively small immersion tank is the most impressive. That's performance density at its finest.
The Gyoukou supercomputer employs 10,000 PEZY-SC2 chips spread over 20 immersion tanks, but the company "only" enabled 1,984 cores per chip, which equates to 19,840,000 cores. Each core wields eight threads, which works out to more than 158 million threads. That system also includes 1,250 sixteen-core Xeon-D processors, adding over 20,000 Intel cores. In aggregate, that's the most cores ever packed into a single supercomputer.
The Gyoukou system is ranked fifth on the Green500 list for power efficiency and fourth on the Top500 list for performance. It delivers up to 19,135 TFLOPS and 14.173 GFLOPS-per-watt.
Incredible numbers aside, the company also has its PEZY-SC3 chips in development, with 8,192 cores and 65,536 threads per chip. That means the core and thread counts of its next-generation ZettaScaler systems will be, well, beyond astounding. We can't wait.
The Vector Processor
Picture 16 of 32NEC is the last company in the world to manufacture Vector processors for supercomputers. NEC's Vector Engine full-height full-length PCIe boards will serve as the key components of the SX-Aurora TSUBASA supercomputer.
NEC claimed that its 16nm FinFET (TSMC) Vector core is the most powerful single core in HPC. It provides up to 307 GigaFLOPS per 1.6 GHz core. Eight cores come together inside the processor, and each core is fed by 150GB/s of memory throughput. That bleeding-edge throughput comes from six HBM2 packages (48GB) that flank the processor. That's a record number of HBM2 packages paired with a single processor. In aggregate, they provide a record of 1.2 TB/s of throughput to the Vector processor.
All told, the Vector Engine delivers 2.45 TeraFLOPS of peak performance and works with the easy-to-use x86/Linux ecosystem. Of course, the larger supercomputer is just as interesting.
The Vector Engine A300-8 Chassis
Picture 17 of 32NEC's SX-Aurora TSUBASA supercomputer employs eight A300-8 chassis (pictured above) that host eight Vector Engines each. Each rack in the system provides up to 157 TeraFLOPS, 76 TB/s of memory bandwidth, and consumes 30KW. NEC's previous-generation SX-ACE model required 10 racks and 170KW of power to achieve the same feat.
What's It Take To Immersion Cool 20 Nvidia Teslas?
Picture 18 of 32We love to watch the pretty bubbles generated by immersion cooling, but there are hidden infrastructure requirements. Allied Control's 8U 2-Phase Immersion Tank, which we'll see in action on the next slide, offers 48kW of cooling capacity, but it requires a sophisticated set of pumps, radiators, and piping to accomplish the feat.
In a data center environment, the large coolers are placed on the outside of the building. In fact, Allied Control powers the world's largest immersion cooling system. The BitFury data center in the Republic of Georgia, Eastern Europe, processes more than five times the workload of the Hong Kong Stock Exchange (HKSE), but accomplishes the feat in 10% of the floor space and uses a mere 7.4% of the exchange's electricity budget for cooling (48MW for HKSE, 1.2MW for BitFury).
So saving money is what it's all about. Let's see the system in action with 20 of Nvidia's finest.
Allied Control Immersion Cooling 20 Tesla GPUs
Picture 19 of 32The Allied Control demo consisted of 20 immersion-cooled Nvidia Tesla GPUs and a dual-Xeon E5 server. The demo system draws 48kW, but the cooling system supports up to 60kW.
The company uses custom ASIC hardware and a special motherboard. The company replaced the standard GPU coolers with a 1mm copper plate that allows it to place the GPUs a mere 2.5mm apart. That's much slimmer than the normal dual-slot coolers we see on this class of GPU.
Allied Control Immersion Cooling Five Nvidia GeForce 1070s
Picture 20 of 32So, we've seen the big version, but what if we could fit it on a desk? The Allied Control demo in Gigabyte's booth consisted of a single-socket Xeon-W system paired with six Nvidia GeForce 1070s. The GeForce GPUs were running under full load, yet stayed below 66C.
Although the immersion cooling tank could fit on the top of the platform, there was some additional equipment underneath. The cabinet was sealed off, though. In either case, we're sure most enthusiasts wouldn't have a problem sacrificing some space for this much bling on their desktop.
Anything You Can Do, OSS Can Do Bigger
Picture 21 of 32One Stop Systems (OSS) holds a special place in our hearts. We put 32 3.2TB Fusion ioMemory SSDs head-to-head against 30 of Intel's NVMe DC P3700 SSDs in the FAS 200 chassis last year. That same core design, albeit with some enhancements in power delivery, can house up to 16 Nvidia Volta V100s in a single chassis. These robust systems are used for mobile data centers, such as for the military, along with supercomputing applications.
Altogether, 16 Volta GPUs wield 336 billion transistors and 13,440 CUDA cores, making this one densely packed solution for heavy parallel compute tasks. If your interests are more focused on heavy storage requirements, the company also offers a wide range of densely-packed PCIe storage enclosures.
SCINet Network Operations Center At SuperComputing 2017
Picture 22 of 32The network infrastructure required to run a massive tradeshow may not seem that exciting, but Supercomputing 17's SCinet network is an exception. This massive collection of eight racks consisted of $66 million in hardware and software from 31 contributors. The network provided 3.63 Tb/s of WAN bandwidth to multiple locations all around the world. It required 60 miles of fiber, 2,570 fiber patches, and 280 wireless access points. The final SCinet network was the culmination of a year of planning and two weeks of assembly with 180 volunteers. Tearing down is easier than building up, so the volunteers were able to disassemble the entire network in one day.
The Stratix 10 Development Board
Picture 23 of 32Intel announced its Stratix 10 FPGAs back in October 2016. The new FPGA comes with a quad-core ARM Cortex-A53 processor, though we imagine that will be swapped out with Intel cores in future models. The card also has the option for embedded HBM2 in the package.
We spotted a development board at the Reflex CES booth. Reflex CES designs custom embedded solutions and is working on a Stratix 10 GX/SX design that features three onboard DDR4 banks. It currently has three models shipping.
The beefier Intel Stratix 10 MX models come with HBM2 in the package, so Intel can mix and match components. That's a key advantage of Intel's new EMIB (Embedded Multi-Die Interconnect Bridge). This technology (deep dive here) made its debut on the Stratix 10 line of FPGAs and also serves as one of the key technologies enabling Intel's surprising use of AMD's graphics on its next-gen H-Series mobile processors.
The development board obviously requires quite a bit of cooling, but that's the result of the passive heatsink on the device. Like most server components, the high rate of linear airflow inside of a server will keep it comfortably within its normal operating temperature.
Intel's Arria 10 Solution
Picture 24 of 32Speaking of Intel's FPGA's, the company announced last month that it would bring its Programmable Acceleration Card to market with the Arria 10 GX FPGA. The company had the card displayed in its booth, but it will eventually make its way into OEM systems from Dell EMC, among others. The company said the card will become available in the first half of 2018.
The Arria 10 FPGA features 1.15 million logic elements. The card connects to the host via a PCIe 3.0 x8 connection and comes with 8GB of onboard DDR4 memory and an undisclosed amount of flash storage. It also features an integrated 40Gb/s "capable" QSFP+ interface for networking.
Xilinx FPGA And EPYC Server
Picture 25 of 32Xilinx had several demos of FPGA-enabled workloads and systems in action. Of course, the single-socket EPYC server paired with four VU9P Virtex UltraScale+ FPGAs caught our eye. This system serves as a perfect example of the capabilities borne of EPYC's burly PCIe connectivity. Memory capacity and performance is also a limiting factor for many workloads, so EPYC's 145GB/s of bandwidth, eight DDR4 channels, and 2TB of memory capacity for a single socket server is a good fit for many diverse workloads.
We took a deeper look at the system in our "Xilinx Pairs AMD's EPYC With Four FPGAs" article.
Xilinx Facial Detection Demo
Picture 26 of 32AI workloads are spreading out to the edge for many purposes, such as facial detection and encoding in surveillance systems. Xilinx had an interesting demo that showed a working system in action.
The Mini Data Center
Picture 27 of 32If there's one thing we love, it's mini systems. Packing a ton of compute in a small space is the ultimate goal for all forms of computing, so the DOME hot water-cooled microDataCenter stood out from the pack.
The joint IBM/DOME project has created a series of hot-swappable cards that serve many purposes. You can pick from ARM Cortex-A72, Kintex-7 FPGAs, PowerPC-64, Xeon-Ds, or Xavier GPU cards (several pictured at the front of the display) to mix-and-match compute capabilities in a truly miniature data center (the mouse in the upper right serves as a good sense of scale). Here we see 32 cards crunching away with various processors. There are also dedicated NVMe storage cards available, but networking is built into the host PCB.
Water and power flow through both sides of the enclosure. The group claimed that two of these mini racks, which drop into a standard 2U server rack, can offer as much performance as 64 2U nodes. The result? 50% less energy consumption in 90% less space. We've grossly oversimplified the project, but look to these pages for more coverage soon.
The 750-Node Raspberry Pi Cluster
Picture 28 of 32The Los Alamos National Laboratory's HPC Division has deployed 750-node Raspberry Pi clusters for development purposes. The lab is using five racks of BitScope Cluster modules, with 150 Raspberry Pi units each, to build a massively parallel cluster featuring 750 four-core processors for a total of 3,000 cores. The ARM-based system sips a mere 4,000W under full load. Unfortunately, the company did not have a working demo of its cluster at the show, but we've got a much smaller Raspberry Pi cluster demonstration on the following page.
University of Edinburgh Homemade Raspberry Pi Cluster
Picture 29 of 32The University of Edinburgh's homemade Raspberry Pi cluster isn't as expansive as the 3,000-core behemoth on the preceding slide, but it serves as a good illustration of how 18 low-power (not to mention cheap) Raspberry Pis can exploit parallel computing to perform complex operations, in this case designing and testing an aircraft wing.
The university built the Wee Archie Raspberry Pi cluster to teach the basics of supercomputing and open sourced the design here.
Nvidia DGX-1
Picture 30 of 32Nvidia's DXG-1 systems, which we've covered several times, are available to the public, but Jensen Huang is also using them as a building block to update the company's SaturnV supercomputer. The company claimed it should land within the top ten on the Top500 supercomputer list, and perhaps even break into the top 5.
The SaturnV supercomputer will be powered by 660 Nvidia DGX-1 nodes (more detail on the nodes here), spread over five rows. Each node houses eight Tesla V100s for a total of 5,280 GPUs. That powers up to 660 PetaFLOPS of FP16 performance and a peak of 40 PetaFLOPS of FP64. The company plans to use SaturnV for its own autonomous vehicle development programs.
Nvidia Workstation
Picture 31 of 32And finally, Nvidia's DGX Station. This good-looking box comes with four watercooled Tesla V100s and can churn out 480 TeraFLOPS of performance (FP16). The system features 64GB of GPU memory and a total of 20,480 CUDA cores. Throwing in 256 GB of DDR4 memory and three 1.92TB SSDs in RAID 0 sweetens the deal, but only for those brave-hearted folks who actually trust RAID 0 to store anything. In either case, Nvidia includes a 1.92TB SSD for the operating system and a dual 10Gb LAN networking solution rounds out the system.
Your power bill might shoot up--the system consumes 1,500W under full load--but it won't seem as bad after you pay $69,000 for this stunner.