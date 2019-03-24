The History of DOE Nuclear Supercomputers: From MANIAC to AMD's EPYC Milan

History of DOE Nuclear Supercomputers

The United States Department of Energy (DOE) has been at the forefront of supercomputing for seventy years and has largely paved the way for the entire supercomputing industry. To celebrate that legacy, the agency had several displays at its booth at the Supercomputing 2018 tradeshow.

Although the agency was only officially incorporated in 1977, it sprung from the US governments' Manhattan Project that was originally managed by the Army Corps of Engineers. That project introduced the world to the first atomic bomb during the waning days of World War II. The agency is still responsible for designing, building, and testing nuclear weapons, but it also steers the country's energy research and development programs.

The DOE's work began back in 1945 (before the age of transistors) with the first machine used to study the feasibility of a nuclear weapon, the famed vacuum tube-equipped ENIAC (Electronic Numerical Integrator and Computer). That work progressed to custom and vector processors, like the famous Cray-1, in the years spanning 1960 to 1989.

1992 to 2005 found the agency's supercomputers evolving to single-core processors and symmetric multiprocessing (SMP), while 2005 also rung in the first supercomputers with multi-core / SMP processors. But now we're in the age of accelerators, otherwise known as GPUs, that have taken the supercomputing world by storm.

The agency currently hosts the Summit and Sierra supercomputers that rank #1 and #2 in the world, respectively. And now, after 70 years, the agency is on the cusp of deploying the world's first exascale-class supercomputers. These exascale machines will calculate upwards of one billion calculations per second, a 1000x improvement over the petascale supercomputers that were the previous benchmark for mind-bending compute power.

All told, the far-flung agency manages 17 National Labs that include such well-known names as the Los Alamos National Laboratory, Oak Ridge National Laboratory, and Sandia National Laboratories, among many others. These facilities tout the most advanced supercomputers on the planet, but they all start with simple components.

The MANIAC

Picture 2 of 22

The ENIAC (Electronic Numerical Integrator and Computer) supercomputer gets all the attention in the history books, but lest we forget its brother with a much cooler name, the DOE had its MANIAC on display.

This module served as just one part of the MANIAC (Mathematical Analyzer, Numerical Integrator, and Computer) supercomputer that was built between 1949 and 1952 at the Los Alamos National Laboratory.

As a pre-transistor computer, this machine wielded 2,850 vacuum tubes and 1,040 diodes in the arithmetic unit alone. All told, it featured 5,190 vacuum tubes and 3,050 semiconductor diodes. This 1,000-pound computer generated precise calculations of the thermonuclear process and was one of the first computers based on the von Neumann architecture.

The CDC 600

Picture 3 of 22

In 1966, the Control Data Corp. (CDC) 6600 was the fastest supercomputer on the planet with up to three megaFLOPS of performance. That's three million floating point operations per second. This freon-cooled supercomputer consisted of four cabinets with a single central CPU constructed of multiple PCBs. It also had ten peripheral processors that handled input and output operations. Seymour Cray, the father of the famous Cray supercomputers, designed the machine.

The Cray-1

Picture 4 of 22

The Cray-1 is perhaps one of the best-known supercomputers of all time, partially because of its C-shaped design and leather seating that earned it the distinguished moniker of "the world's most expensive love seat." This Cray-1 system, the first off the assembly line, made an appearance at the Supercomputing 2018 tradeshow. It was originally installed at the Los Alamos National Laboratory in 1976.

Unbeknownst to the casual observer, the speedy supercomputer was designed in a semi-circle to reduce the length of the wires connecting the individual components, which helped reduce latency and improve performance. The posh seating around the tower concealed the power delivery subsystem that powered the 115 KW machine. Overall, the machine weighs in at a beastly 10,500 pounds and consumes 41 square feet of floor space. The single freon-cooled processor sped along at 80 MHz and addressed 8.39 Megabytes of memory spread across 16 banks. Storage weighed in at a mere 303 MB. This supercomputer delivered 160 megaFLOPS of performance, all for a mere $7.9 million.

Strap SANDAC On A Missile

Picture 5 of 22

What do you do if you need a supercomputer for missile navigation and guidance? You design and build one.

The Sandia National Labs Airborne Computer (SANDAC) supplied all the power of the Cray-1 supercomputer on the preceding page. But highlighting the speed of innovation during that time period, it packed all of that horsepower into a device the size of a shoebox that weighed less than 20 pounds. That's a big decrease in size from the 41-square foot, 10,500 lb. Cray-1, but SANDAC arrived nine years (1985) after the Cray machine.

SANDAC consisted of up to 15 custom-built MIMD processors from Motorola and offered up to 225 Million Instructions Per Second (MIPS) of performance. This was one of the first parallel computers, and the first parallel computer to fly on a missile. Like any supercomputer, it had its own memory and included modules for input and output operations. Durability was a key tenet of the design: it was designed to ride in inter-continental ballistic missiles, after all. As such, the unit was designed to survive intense vibration, acceleration, and temperatures up to 190 degrees Fahrenheit.

The Thinking Machines

Picture 6 of 22

Thinking Machines Corporation's (TMC) Connection Machines (CM) debuted in 1991. This board comes from the CM-5 machine, which wielded 1,024 SPARC RISC processors from Sun Microsystems workstations. This was on the of the early supercomputers that paved the way for the massively parallel architectures found in modern supercomputers. These supercomputers were based on the doctoral research of Danny Hillis at MIT during his search for alternatives to the tried-and-true von Neumann architecture.

Pictured above is a single circuit board from the supercomputer. A CM-5 supercomputer sat atop the TOP500 list in 1993, meaning it was the fastest supercomputer in the world, with 131 gigaFLOPS of performance. CM-5 systems continued to dominate the top ten rankings for several years after that.

The Intel Paragon Node

Picture 7 of 22

Intel delivered its Paragon General-Purpose (GP) Node to the Oak Ridge National Laboratory in 1992. Each node carried two i860 XP processors (RISC), one dedicated solely to applications and the other to message processing. All told, the system featured 2,048 i860 processors spread across 1,024 of these nodes. Each node carried a whopping 16MB of memory and a B-NIC interface that connected to mesh routers on the backplane. The entire system pushed out peak performance of 143.4 gigaFLOPS, landing it at the top of the TOP500 list in June 1994.

The Paragon XP/S node pictured here leveraged the now-ubiquitous x86 instruction set and was used for code development.

Intel Breaks the teraFLOPS Barrier with ASCI Red

Picture 8 of 22

The teraFLOPS (one trillion floating point operations per second) barrier loomed large for scientists as supercomputers evolved, but Intel and Sandia National Labs finally smashed that barrier in 1996.

The ASCI Red system consisted of 104 cabinets that consumed 1,600 square feet of floor space. Seventy-six of the cabinets were dedicated to housing 9,298 Pentium Pro processors clocked at a whopping 200 MHz (later upgraded to Pentium II Xeon's running at 333 MHz).

The system had 1.2 TB of RAM dispersed throughout, along with eight cabinets dedicated to switching and 20 cabinets dedicated to disk storage. The storage subsystem delivered up to 1 GB/s of throughput courtesy of the Parallel File System (PFS) that was a key component to smashing the teraFLOPS barrier. The entire system consumed up to 850 kW of power.

The system was only 75 percent complete when it first broke the teraFLOPS barrier (as measured by LINPACK), but it later reached up to 3.1 teraFLOPS in 1999 after memory and processor upgrades. The system led the TOP500 from 1997 to 2000. This was the first system built under the Accelerated Strategic Computing Initiative (ASCI), a program designed to maintain the US nuclear arsenal after the halt of nuclear testing in 1992.

The Cray T3E-900 Compute Module

Picture 9 of 22

Here we see two modules from the "MCurie" Cray T3E-900 system that was installed at the Lawrence Berkley National Laboratory in 1997. The MCurie system wielded 512 DEC Alpha 21664A microprocessors running at 450 MHz, though Cray's design scaled up to 2,176 processors. Each compute module housed up to 2 GB of DRAM and a six-way interconnect router.

For purposes of record keeping, floating point performance is typically measured with LINPACK. However, this largely synthetic test doesn't require data movement or factor in other system-level considerations, thus supplying only a rough estimate of peak compute power. A team from the Oak Ridge National Laboratory won the 1998 Gordon Bell prize for reaching 657 gigaFLOPS of performance during while processing actual applications. Later, the same team used a T3E at Cray's factory to break 1.02 teraFLOPS of performance, marking the first time in history that a computer exceeded one teraFLOPS of sustained performance during a scientific application.

IBM Powers its way in

Picture 10 of 22

An IBM "Eagle" RISC System (RS)/6000 SP supercomputer with the Power3 architecture was installed at the Oak Ridge National Laboratory in 1999. This system had 124 processors that pushed out 99.3 gigaFLOPS, though it was later upgraded to a total of 704 processors. Here we can see a close-up shot of a Power3 processor that harnesses the power of 15 million transistors on a 270 mm2 die. IBM itself fabbed the processor on its CMOS-6S2 process (roughly equivalent to 250nm). The processor blasted along at 200 MHz, though its Power3-II successor brought that up to 450 MHz.

ASCI White

Picture 11 of 22

The ASCI White supercomputer was built of 512 of the IBM RS/6000 systems (previous slide) tied together. The system features 16 IBM Power3 processors per node, 8,192 processors running at 375 MHz, 6 TB of memory, and 160 TB of disk-based storage. The entire system was comprised of three different types of cabinets; the 512-node White, 28-node Ice, and 68-node Frost.

The 100-ton system was dedicated in 2001 and cost $110 million. The ASCI White supercomputer alone needed 3 MW of power, but another 3 MW was dedicated just to cooling the beast. The machine topped out at 12.3 teraFLOPS and was the fastest supercomputer on the TOP500 list from November 2000 to June 2002.

Big Blue's Blue Gene/L

Picture 12 of 22

IBM designed the Blue Gene/L to study protein folding and gene development. It pushed out a sustained 478.2 teraFLOPS of computing power, thus earning the top spot on the TOP500 list from November 2004 to June 2008.

The supercomputer featured compute cards with a single ASIC and DRAM memory chips affixed. Each ASIC housed two PowerPC 440 embedded processors that supplied up to 5.6 gigaFLOPS of performance per compute card. IBM's decision to use relatively slow embedded processors was born of a desire to reign in power consumption while exploiting the benefits of a massively parallel design.

Above you can see 16 of these compute cards slotted into a single node. The two additional cards handle I/O operations. This dense design allowed for up to 1,024 compute nodes to fit into a single 19-inch server rack. Blue Gene/L could scale up to 65,536 total nodes.

Cray's Red Storm

Picture 13 of 22

Popular supercomputer-builder Cray developed "Red Storm" in 2005. Modern supercomputing is all about data movement, and this design is most notable for its 3D mesh networking topology and its custom SeaStar chips that combined both a router and NIC onto the same silicon. The design used 10,880 off-the-shelf AMD Opteron single-core CPUs clocked at 3.0 GHz.

The original 140-cabinet system consumed 3,000 square feet of floor space and peaked at 36.19 teraFLOPS. As time progressed, Red Storm scaled up via faster 2.4 GHz dual-core AMD Opteron’s and scaled out with another row of cabinets, thus totaling 26,000 processing cores in a single machine. That yielded a peak of 101.4 teraFLOPS of performance, but a subsequent upgrade to quad-core Opteron’s and 2GB of memory per core brought peak performance up to 204.2 teraFLOPS.

The Red Storm (Cray XT3) occupied the Top 10 of the TOP500 list from 2006 to 2008.

The Hopper

Picture 14 of 22

Cray's XE6 "Hooper" came with 12,768 AMD Magny-Cours chips. Each of those chips came with 12 cores, making this a 153,408-core supercomputer. Hopper was the first petaFLOP supercomputer in the DOE Office of Science's arsenal.

The system had 6,384 of the nodes pictured above, with each node featuring two of the AMD Magny-Cours processors and 64 GB of DDR3 SDRAM memory. Up to 3,072 cores fit into a single cabinet. Unlike the Red Storm, this machine moved from the SeaStar networking chips to new Gemini router ASICs.

Hopper was listed in fifth place in the November 2010 TOP500 list with a peak speed of 1.05 petaFLOPS.

Blue Gene/Q Sequoia

Picture 15 of 22

IBM's crams 1,572,864 cores into its 96-rack Blue Gene/Q "Sequoia" system. This system has 1.6 petabytes of memory spread across 98,304 compute nodes. The system uses IBM's 16-core A2 processors that run at 2.3 GHz. Each A2 core runs four threads simultaneously, meaning this is an SMT4 processor.

As we can see from the copper tubing snaking throughout the ASC Sequoia node, this system relies heavily upon water cooling. The Blue Gene/Q spent time on the Green500 list, meaning it was one of the most power-efficient supercomputers at the time. Even though the system is orders of magnitudes faster than the Blue Gene/L we covered on a preceding slide, it is 17 times more power efficient.

In 2013, the system set a record of 504 billion events processed per second, easily beating the earlier record of 12.2 billion events per second. To put that in perspective, the Lawrence Livermore National Laboratory, which houses Sequoia, equates one hour of the supercomputer’s performance to all 6.7 billion people on earth using calculators and working 24 hours per day, 365 days per year, for 320 years.

Blue Gene/Q took the crown as the fastest supercomputer in the world in 2012 with 17.17 petaFLOPS of performance.

Reaching the Summit - The World's Fastest Supercomputer

Picture 16 of 22

The age of accelerators is upon us. Or, as we call them, GPUs. Most of the new compute power added to today's supercomputers comes from GPUs, while the CPUs have now fallen to more of a host role.

Here we see a Nvidia Volta GV100 GPU that's used in the Summit supercomputer at Oak Ridge. This 4,600-node supercomputer helped the United States retake the supercomputing lead from China last year, and it is currently the fastest supercomputer in the world. It comes with 2,282,544 compute cores and pushes up to 187,659 teraFLOPS.

We've got a deep dive of the Summit supercomputer node here, so we won't dive too deep into details here. Just know that it comes equipped with 27,000 Nvidia Volta GPUs and 9,000 IBM Power9 CPUs tied together at the node level via native NVLink connections. It also has other leading-edge tech, like PCIe 4.0 and persistent memory. 

Just What is This?

Picture 17 of 22

We aren't sure what this is, but it could be a prototype node based on Intel's now-canceled Knights Hill. That 10nm processor was destined to appear in the DOE's Aurora supercomputer, but Intel canceled the product with little fanfare. The DOE later announced that Aurora would feature an "advanced architecture" for machine learning but hasn't provided specifics.

Much like Intel's Xeon Scalable processors, Knight's Mill featured an integrated Omni-Path connection that interfaced with cabling attached directly to the chip, hence the need for an open-ended socket like we see on this unnamed node. We inquired with DOE representatives about the mystery node, but to no avail. This could simply be a socket LGA 3647 system built to house Intel's Xeon Scalable processors, which can also sport integrated Omni-Path controllers and six-channel memory, but perhaps it is something more.

Astra - The First Petascale ARM Supercomputer

Picture 18 of 22

The DOE has almost finished building its new Astra supercomputer, which is the first petascale-class ARM-based supercomputer. This supercomputer features 36 racks of compute that come with 18 quad-node HPE Apollo 70 chassis per rack. That equates to 2,592 compute nodes in total.

Each node has two 28-core Cavium Thunder-X2 ARM SoCs running at 2.0 GHz for a total of 5,184 CPUs and 145,152 cores. Each socket supports eight memory channels, for a total of 128GB of memory per node. That leads to an aggregate capacity of 332 TB and bandwidth of 885 TB/s.

The incomplete system has already broken the 1.5 petaFLOPS barrier, and upon completion, it is projected to hit 2.3 petaFLOPS or more. Other notable inclusions include the 244 GB/s all-flash storage system that eliminates the need for costly and complex burst buffers that absorb outgoing data traffic.

As expected from an ARM-based system, low power consumption is a key consideration. The system consumes 1.2 MW in the 36 compute racks, but only 12 fan coils cool those due to its advanced liquid-based Thermosyphon cooler hybrid system.

AMD's Milan to Power Perlmutter

Picture 19 of 22

It didn't take long for us to hone in on Cray's Shasta Supercomputer (deep dive here). This new system comes packing AMD's unreleased EPYC Milan processors. Those are AMD's next-next generation data center processors. The new supercomputer will also use Nvidia's "Volta-Next" GPUs, with the two combining to make an exascale-class machine that will be one of the fastest supercomputers in the world.

The Department of Energy's forthcoming Perlmutter supercomputer will be built with a mixture of both CPU and GPU nodes, with the CPU node pictured here. This watercooled chassis houses eight AMD Milan CPUs. We see four copper waterblocks that cover the Milan processors, while four more processors are mounted inverted on the PCBs between the DIMM slots. This system is designed for the ultimate in performance density, so all the DIMMs are also watercooled.

Intel to Power Aurora, First Exascale Supercomputer

Picture 20 of 22

Intel and the U.S. Department of Energy (DOE) announced that Aurora, the world's first supercomputer capable of sustained exascale computing, would be delivered to the Argonne National Laboratory in 2021. Surprisingly, the disclosure includes news that Intel's not-yet-released Xe graphics architecture will be a key component of the new system, along with Intel's Optane Persistent DIMMs and a future generation of Xeon processors.

Intel and partner Cray will build the system, which can perform an unmatched quintillion operations per second (sustained). That's a billion billion operations, or one million times faster than today's high-end desktop PCs. The new system is comprised of 200 of Cray's Shasta systems and its "Slingshot" networking fabric.

More importantly, the system leverages Intel's Xe graphics architecture. In its announcement, Intel said Xe will be used for compute functions, meaning it will be primarily used for AI computing. Aurora also comes armed with "a future generation" of Intel's Optane DC Persistent Memory using 3D XPoint that can be addressed as either storage or memory. This marks the first known implementation of Intel's new memory in a supercomputer-class system.

The DOE told us that the new system will be "stood up" in early 2021 and will be fully online at exascale compute capacity before the end of that year.

The Future is Exascale

Picture 21 of 22

The Department of Energy has a robust pipeline of future supercomputers in the works for several of its sites, here's a list with links to the next-gen supercomputers.

Paul Alcorn

Paul Alcorn is a Senior Editor for Tom's Hardware US. He writes news and reviews on CPUs, storage and enterprise hardware.

  • alextheblue
    Quote:
    IBM itself fabbed the processor on its CMOS-6S2 process (roughly equivalent to 25nm).

    No. First off it was a hybrid process (0.25 µm feature sizes and 0.35 µm metal layers). Second, and more importantly, that's 250 nm. It's successor was built on a 0.22 µm process which is why they were able to boost the clocks so much.
  • PaulAlcorn
    5190 said:
    Quote:
    IBM itself fabbed the processor on its CMOS-6S2 process (roughly equivalent to 25nm).
    No. First off it was a hybrid process (0.25 µm feature sizes and 0.35 µm metal layers). Second, and more importantly, that's 250 nm. It's successor was built on a 0.22 µm process which is why they were able to boost the clocks so much.


    Thanks for catching the typo, fixed!
  • stdragon
    For those that are unaware, these computers are used primarily due to post Nuclear-Test-Ban Treaties. It's to ensure the reliability of existing stockpiles of warheads as they age. For that, you need lots of number crunching.

    Basically, making sure our spears are strong and the arrowheads remain sharp. All without actually throwing a single one.
  • matheo25
    Sooo... My 2012 i5-3570k pushes out about as much (109GFLOPS) as the world's best computer of 1994 (143.4GFLOPS). I remember my 386 back then and it was not impressive at all, I cannot even begin to comprehend the power of these new supercomputers. Thanks for the article!
  • BillV523
    it looks like an ordinary mobo for a pc
  • BillV523
    seems can't edit last post, I meant of the photo with the sign under it asking
    "what is it?"
