23 Years Of Supercomputer Evolution

Page 1 of 3:

Page 1

23 Years Of Supercomputer Innovation

Update, 6/29/16, 12:25pm PT: Added information on Shenwei Sunwei Taihulight, which is now the undisputed fastest supercomputer in the world.

Article continues below

MORE: Best PC Builds

MORE: How To Build A PC

June 1993: CM-5/1024

The TOP500 ranking of supercomputers was first published in June, 1993. At that time, the most powerful computer in the world was a CM-5 located in the University of California's Los Alamos National Laboratory, managed by the US Department of energy, and manufactured by Thinking Machine.

The CM-5 / 1024 was composed of 1024 SuperSPARC processors operating at 32 MHz. The theoretical computational power of this system was 131 GFlops, but reached less than half that (59.7 GFlops) under the LINPACK benchmark used to determine the TOP500 rankings. The CM5 also served another purpose in 1993, when it was chosen by Steven Spielberg's production team to "embody" the brain of the control room in the film Jurassic Park (the five black towers with red lights).

June 1994: XP/S 140

In June 1994, the CM-5 was dethroned by the Intel XP/S 140 Paragon. This supercomputer, purchased by Sandia National Laboratories in New Mexico, employed 3,680 Intel i860 XP processors, one of the few chips which implemented a RISC instruction set manufactured by Intel. The i860 was innovative for its time, incorporating a 32-bit arithmetic unit and a 64-bit floating point unit (FPU). Each processor had access to 32 32-bit registers, which could also be used as 16 64-bit registers or 128 8-bit registers. The set of instructions executable by the FPU also included SIMD type instructions that laid the foundation for the future MMX instruction set used in Intel's Pentium line of products.

Each i860 XP processor, designed to run at 40 - 50 MHz, delivered a gross 0.05 GFlops of computational power. The theoretical power of the XP/S 140 was 184 GFlops, and in practice it reached 143.4 GFlops in Linpack.

November 1994: Japan Takes The Win

In November 1994, Japan replaced the US on top of the TOP500 with the Numerical Wind Tunnel, a supercomputer manufactured by Fujitsu for the National Aerospace Laboratory of Japan.

This machine marks a change in tactics from the world's previously most powerful supercomputers, in that it draws its power from only 140 vector, not scalar processors. These processors were composed of 121 individual cores, arranged in a matrix of 11 x 11, and each chip had a dedicated function. Each processor also contained four independent pipelines, and was capable of handling two Multiply-Add instructions per cycle. A "processor" by itself consumed 3000 W, and required water cooling.

Running at 105 MHz, these processors were particularly well suited to simulate the flow of fluid. Each CPU delivered a theoretical 1.7 GFlops of computational power. This added up to over 238 GFlops of theoretical processing power, making the Numerical Wind Tunnel the first computer to break the 200 GFlops bar, although its performance in Linpack was slightly lower (124 GFlops, then 170 GFlops, and finally 192 GFlops).

June 1996: Hitachi Beats Fujitsu

The following year, Japan increased its standing in the TOP500 by introducing the SR2201 / 1024. This supercomputer was built by Hitachi for the University of Tokyo. This new system surpassed Fujitsu's Numerical Wind Tunnel computer, giving Japan the top two spots of the TOP500, and dropping the U.S. into third.

Unlike the Numerical Wind Tunnel, this system reverted to the use of scalar processors, and utilized the HARP-1E CPU based on the PA-RISC 1.1 architecture. The SR2201 / 1024 contained a total of 1024 of these CPUs clocked at 150 MHz, each theoretically capable of 300 MFlops of computational power, giving the SR2201 / 1024 an accumulated theoretical computational work force of 300 GFlops. The HARP-1E also introduced a mechanism called Pseudo Vector Processing to preload data directly into CPU registers without going through the cache. Thanks to this feature, among other things, the performance of the SR2201 / 1024 was exceptional for its time period. Under Linpack GFlops the SR2201 / 1024 reached 232.4 GFlops, 72% of its theoretical power.

June 1997: The Teraflop Threshold Is Vanquished

To take back technological leadership from Japan, the United States launched the Accelerated Strategic Computing Initiative (or ASCI) in 1992. The first successful project of this program was the development of the ASCI Red, a supercomputer built by Intel for the Sandia Lab, the same facility which owned the Intel XP/S 140. The ASCI Red impressed people around the world, as it was the first computer in history to cross the teraflop barrier.

With its 7,264 Pentium Pro processors operating at 200 MHz, it possessed a theoretical 1.453 TFlops of computational power and generated 1.068 TFlops under Linpack. The ASCI Red was one of the first supercomputers to use mass production components, and with its modular and scalable architecture, the ASCI Red to stayed listed in the TOP500 for 8 years.

June 1998: ASCI Red 1.1

In June 1998, ASCI Red was expanded to incorporate an additional 1888 Pentium Pro processors. Although it took the lead on the TOP500 in 1997, at that time it was only 75 percent complete. Now finished, with 9152 Pentium Pro CPUs clocked at 200 MHz, the system was theoretically capable of 1830 GFlops, and managed to reach 1338 GFlops under Linpack.

June 1999: ASCI Red 2.0

In 1999, Intel updated the ASCI Red by replacing the older Pentium Pro processors with Pentium II OverDrive CPUs, which used the Socket 8 interface. In addition to a refined architecture and a higher clock speed - 200 MHz on the Pentium Pro vs 333 MHz on the Pentium II Overdrive, Intel took this opportunity to also increase the number of CPUs from 9152 to 9472. These improvements multiplied the theoretical computational power of the ASCI Red by a factor of 1.7, pushing it past 3.1 TFlops, but in practice, the system was only able to achieve 58 percent of its theoretical performance, topping out at 2.121 TFlops.

June 2000: ASCI Red 2.1

After its rise to the top, the ASCI Red would stay on top of the TOP500 for another three years. Eventually the system would see another increase in CPU core count, climbing to a total of 9,632 processors. Its theoretical performance would top out at 3.207 TFlops, and under the Linpack test it would ultimately achieve 2,379 TFlops of computational power. In its final configuration, the ASCI Red occupied an area of 230 square meters and consumed 850 kilowatts of power, not including the energy required for cooling. The ASCI Red would remain in operation and on the TOP500 as one of the world's fastest super computers until it was retired in 2005, then decommissioned in 2006.

June 2001: ASCI White

Eventually the ASCI Red was dethroned by a supercomputer specifically designed to replace it; the ASCI White. This new supercomputer was installed in the heart of Lawrence Livermore National Laboratory. At half strength, the system became operational in November 2000, and was completed in June 2001.

Unlike the ASCI Red which was built by Intel, the ASCI White was IBM's chance to shine. ASCI White derived its power from 8192 IBM Power3 processors clocked at 375 MHz. ASCI White represents a new trend among supercomputers, adopting a cluster. A cluster architecture is a collection of individual nodes connected together to work as a single system. Today, clustering is used by 85 percent of the supercomputers listed on the TOP500.

ASCI White actually includes 512 RS/6000 SP servers, each containing 16 CPUs. Each CPU was capable of 1.5 GFlops of processing power, which made ASCI White theoretically capable of reaching 12.3 TFlops. It's real-world performance was considerably lower, only reaching 7.2 TFlops under Linpack (7.3 TFlops from 2003).

ASCI White required 3,000 kW of power to operate, with an additional 3,000 kW consumed by the cooling system.

June 2002: Earth Simulator

In June 2002, the TOP 500 was shook up by the Earth Simulator. Built at the Earth Simulator Center in Yokohama, the Earth Simulator ran circles around ASCI Red and ASCI White. The system managed to achieve 87.5 percent of its theoretical performance, landing at 35.86 TFlops in Linpack; roughly five times more than ASCI White was capable of. The Earth Simulator, dedicated to climate simulations, was constructed from specially designed NEC superscalar processors; each containing a 4-way super-scalar unit and a vector unit. The system components were clocked either at 500 MHz and 1 GHz. Each CPU was capable of 8 GFlops of theoretical processing power and consumed 140 W. The Earth Simulator was organized into 640 nodes with 8 processors each, with each node consuming 10 kilowatts of power.

Current page: Page 1

Next Page Page 2

TOPICS

Michael Justin Allen Sexton is a Contributing Writer for Tom's Hardware US. He covers hardware component news, specializing in CPUs and motherboards.

11 Comments Comment from the forums

adamovera

Archived comments are found here: http://www.tomshardware.com/forum/id-2844227/years-supercomputer-evolution.html
Reply
bit_user

SW26010 is basically a rip-off of the Cell architecture that everyone hated programming so much that it never had a successor.

It might get fast Linpack benchies, but I don't know how much else will run fast on it. I'd be surprised if they didn't have a whole team of programmers just to optimize Linpack for it.

I suspect Sunway TaihuLight was done mostly for bragging rights, as opposed to maximizing usable performance. On the bright side, I'm glad they put emphasis on power savings and efficiency.
Reply
bit_user

BTW, nowhere in the article you linked does it support your claim that:
the U.S. government restricted the sale of server-grade Intel processors in China in an attempt to give the U.S. time to build a new supercomputer capable of surpassing the Tianhe-2.

Reply
bit_user

aldaia only downvotes because I'm right. If I'm wrong, prove it.

Look, we all know China will eventually dominate all things. I'm just saying this thing doesn't pwn quite as the top line numbers would suggest. It's a lot of progress, nonetheless.

BTW, China's progress would be more impressive, if it weren't tainted by the untold amounts of industrial espionage. That makes it seem like they can only get ahead by cheating, even though I don't believe that's true.

And if they want to avoid future embargoes by the US, EU, and others, I'd recommend against such things as massive DDOS attacks on sites like github.

Reply
g00ey

18201362 said:
SW26010 is basically a rip-off of the Cell architecture that everyone hated programming so much that it never had a successor.

It might get fast Linpack benchies, but I don't know how much else will run fast on it. I'd be surprised if they didn't have a whole team of programmers just to optimize Linpack for it.

I suspect Sunway TaihuLight was done mostly for bragging rights, as opposed to maximizing usable performance. On the bright side, I'm glad they put emphasis on power savings and efficiency.

I'd assume that what "everyone" hated was to have to maintain software (i.e. games) for very different architectures. Maintaining a game for PS3, XBOX360 and PC that all have their own architecture apparently is more of a hurdle that if they all were Intel-based or whatever architecture have you. At least XboX360 had DirectX...

In the heydays of PowerPC, developers liked it better than the Intel architecture, particularly assembler developers. Today, it may not have that "fancy" stuff such as AVX, SSE etc but it probably is quite capable for computations. Benchmarks should be able to give some indications...
Reply
bit_user

18208196 said:
I'd assume that what "everyone" hated was to have to maintain software (i.e. games) for very different architectures.
I'm curious how you got from developers hating Cell programming, to game companies preferring not to support different platforms, and why they would then single out the PS3 for criticism, when this problem was hardly new. This requires several conceptual leaps from my original statement, as well as assuming I have even more ignorance of the matter than you've demonstrated. I could say I'm insulted, but really I'm just annoyed.

No, you're way off base. Cell was painful to program, because the real horsepower is in the vector cores (so-called PPEs), but they don't have random access to memory. Instead, they have only a tiny bit of scratch pad RAM, and must DMA everything back and forth from main memory (or their neighbors). This means virtually all software for it must be effectively written from scratch and tuned to queue up work efficiently, so that the vector cores don't waste loads of time doing nothing while data is being copied around. Worse yet, many algorithms inherently depend on random access and perform poorly on such an architecture.

In terms of programming difficulty, the gap in complexity between it and multi-threaded programming is at least as big as that separating single-threaded and multi-threaded programming. And that assumes you're starting from a blank slate - not trying to port existing software to it. I think it's safe to say it's even harder than GPU programming, once you account for performance tuning.

Architectures like this are good at DSP, dense linear algebra, and not a whole lot else. The main reason they were able to make it work in a games console is because most game engines really aren't that different from each other and share common, underlying libraries. And as game engines and libraries became better tuned for it, the quality of PS3 games improved noticeably. But HPC is a different beast, which is probably why IBM never tried to follow it with any successors.

18208196 said:
In the heydays of PowerPC, developers liked it better than the Intel architecture, particularly assembler developers. Today, it may not have that "fancy" stuff such as AVX, SSE etc but it probably is quite capable for computations. Benchmarks should be able to give some indications...
I'm not even sure what you're talking about, but I'd just point out that both Cell and the XBox 360's CPUs were derived from Power PC. And PPC did have AltiVec, which had some advantages over MMX & SSE.

Reply
alidan

18208196 said:
18201362 said:
SW26010 is basically a rip-off of the Cell architecture that everyone hated programming so much that it never had a successor.

It might get fast Linpack benchies, but I don't know how much else will run fast on it. I'd be surprised if they didn't have a whole team of programmers just to optimize Linpack for it.

I suspect Sunway TaihuLight was done mostly for bragging rights, as opposed to maximizing usable performance. On the bright side, I'm glad they put emphasis on power savings and efficiency.

I'd assume that what "everyone" hated was to have to maintain software (i.e. games) for very different architectures. Maintaining a game for PS3, XBOX360 and PC that all have their own architecture apparently is more of a hurdle that if they all were Intel-based or whatever architecture have you. At least XboX360 had DirectX...

In the heydays of PowerPC, developers liked it better than the Intel architecture, particularly assembler developers. Today, it may not have that "fancy" stuff such as AVX, SSE etc but it probably is quite capable for computations. Benchmarks should be able to give some indications...

18209958 said:
18208196 said:
I'd assume that what "everyone" hated was to have to maintain software (i.e. games) for very different architectures.
I'm curious how you got from developers hating Cell programming, to game companies preferring not to support different platforms, and why they would then single out the PS3 for criticism, when this problem was hardly new. This requires several conceptual leaps from my original statement, as well as assuming I have even more ignorance of the matter than you've demonstrated. I could say I'm insulted, but really I'm just annoyed.

No, you're way off base. Cell was painful to program, because the real horsepower is in the vector cores (so-called PPEs), but they don't have random access to memory. Instead, they have only a tiny bit of scratch pad RAM, and must DMA everything back and forth from main memory (or their neighbors). This means virtually all software for it must be effectively written from scratch and tuned to queue up work efficiently, so that the vector cores don't waste loads of time doing nothing while data is being copied around. Worse yet, many algorithms inherently depend on random access and perform poorly on such an architecture.

In terms of programming difficulty, the gap in complexity between it and multi-threaded programming is at least as big as that separating single-threaded and multi-threaded programming. And that assumes you're starting from a blank slate - not trying to port existing software to it. I think it's safe to say it's even harder than GPU programming, once you account for performance tuning.

Architectures like this are good at DSP, dense linear algebra, and not a whole lot else. The main reason they were able to make it work in a games console is because most game engines really aren't that different from each other and share common, underlying libraries. And as game engines and libraries became better tuned for it, the quality of PS3 games improved noticeably. But HPC is a different beast, which is probably why IBM never tried to follow it with any successors.

18208196 said:
In the heydays of PowerPC, developers liked it better than the Intel architecture, particularly assembler developers. Today, it may not have that "fancy" stuff such as AVX, SSE etc but it probably is quite capable for computations. Benchmarks should be able to give some indications...
I'm not even sure what you're talking about, but I'd just point out that both Cell and the XBox 360's CPUs were derived from Power PC. And PPC did have AltiVec, which had some advantages over MMX & SSE.

ah the cel... listening to devs talk about it and an mit lecture, the main problems with it were this
1) sony refused to give out proper documentation, they wanted their games to get progressively better as the console aged, preformance wise and graphically, so what better way then to kneecap devs
2) from what i understand about the architecture, and im not going to say this right, you had one core devoted to the os/drm, then you had the rest devoted to games and one core disabled on each to keep yields (something from the early day of the ps3) then you had to program the games while thinking of what core the crap executed on, all in all, a nightmare to work with.

if a game was made ps3 first it would port fairly good across consoles, but most games were made xbox first, and porting to ps3 was a nightmare.
Reply
bit_user

18224039 said:
2) from what i understand about the architecture, and im not going to say this right, you had one core devoted to the os/drm, then you had the rest devoted to games and one core disabled on each to keep yields (something from the early day of the ps3)
It's slightly annoying that they took a 8 + 1 core CPU and turned it into a 6 + 1 core CPU, but I doubt anyone was too bothered about that.

I think PS4 launched with only 4 of the 8 cores available for games (or maybe it was 5/8?). Recently, they unlocked one more. I wonder how many the PS4 Neo will allow.

18224039 said:
you had to program the games while thinking of what core the crap executed on, all in all, a nightmare to work with.

if a game was made ps3 first it would port fairly good across consoles, but most games were made xbox first, and porting to ps3 was a nightmare.
This is part of what I was saying. Again, the reason why it mattered which core was that the memory model was so restrictive. Each PPE plays in its own sandbox, and has to schedule any copies to/from other cores or main memory. Most multi-core CPUs don't work this way, as it's too much burden to place on software, with the biggest problem being that it prevents one from using any libraries that weren't written to work this way.

Now, if you write your software that way, you can port it to XBox 360/PC/etc. by simply taking the code that'd run on the PPEs and put it in a normal userspace thread. The DMA operations can be replaced with memcpy's (and, with a bit more care, you could even avoid some copying).

Putting it in more abstract terms, the Cell strictly enforces a high degree of data locality. Taking code written under that constraint and porting it to a less constrained architecture is easy. Going the other way is hard.

Reply
alidan

18224479 said:
18224039 said:
2) from what i understand about the architecture, and im not going to say this right, you had one core devoted to the os/drm, then you had the rest devoted to games and one core disabled on each to keep yields (something from the early day of the ps3)
It's slightly annoying that they took a 8 + 1 core CPU and turned it into a 6 + 1 core CPU, but I doubt anyone was too bothered about that.

I think PS4 launched with only 4 of the 8 cores available for games (or maybe it was 5/8?). Recently, they unlocked one more. I wonder how many the PS4 Neo will allow.

18224039 said:
you had to program the games while thinking of what core the crap executed on, all in all, a nightmare to work with.

if a game was made ps3 first it would port fairly good across consoles, but most games were made xbox first, and porting to ps3 was a nightmare.
This is part of what I was saying. Again, the reason why it mattered which core was that the memory model was so restrictive. Each PPE plays in its own sandbox, and has to schedule any copies to/from other cores or main memory. Most multi-core CPUs don't work this way, as it's too much burden to place on software, with the biggest problem being that it prevents one from using any libraries that weren't written to work this way.

Now, if you write your software that way, you can port it to XBox 360/PC/etc. by simply taking the code that'd run on the PPEs and put it in a normal userspace thread. The DMA operations can be replaced with memcpy's (and, with a bit more care, you could even avoid some copying).

Putting it in more abstract terms, the Cell strictly enforces a high degree of data locality. Taking code written under that constraint and porting it to a less constrained architecture is easy. Going the other way is hard.

as for turning cores off, that's largely to do with yields, later on in the consoles cycle they unlocked cores so some systems that weren't bad chips got slightly better performance in some games then other consoles, at least on ps3, that meant all of nothing as it had a powerful cpu but the gpu was bottlenecking it, opposite of the 360 where the cpu was bottlenecking that one at least if i remember the systems right.
Reply
kamhagh

US :S doesn't let china to build a super computer? :s sounds childish
Reply

Show more comments