Use Custom Processor Architecture and Interconnects
In the earlier days of supercomputers, great effort was expended on custom, high-performance architectures. The early Cray computers were brilliant examples of that philosophy. It wasn't that fast processors were difficult to design; it was the packaging that required innovation. It takes about a nanosecond for an electrical signal to traverse one foot of wire. Components had to be packed closely together for best performance. This is still true today, but we have the enormous advantage of the density allowed by modern microprocessors.
Every once in a while good ideas pop up to increase the performance of commodity microprocessors. These ideas usually take the form of auxiliary processors or an existing specialized processor used to off-load the computational burden. While this technique has good short term benefits, data formats are often different than standardized library formats. This can lead to the creation of programs and data files that are difficult to convert when the auxiliary processor is upgraded.
IBM's blue gene is currently the fastest supercomputer.
Modern microprocessor manufacturers do a pretty good job of incrementally increasing performance. They accomplish improvements by upgrading the processor's architecture or by improving clock speed. Four-core processors are a reality now with eight-core processors due soon. Recent architectural improvements include multiple built-in uberfast array processors, or large arrays of simple processor elements. Everyone is working on architectural improvements that allow faster processing at reduced power. Multicore processors facilitate multithreading, and array processors facilitate vector codes. Both are important for supercomputer performance.
It is almost impossible to design, tape out and test your own great idea and keep ahead of the commodity processor manufacturers. The best that can be done without huge capital investments is to convince the commodity microprocessor companies to continue performance improvements through architectural updates as well as silicon technology turns.
Fabrics can be large or small, fast or pokey. The best performing fabric allows all processors to communicate simultaneously with each other. These are called fat fabrics. Other fabrics have single point-to-point connections that don't allow much concurrency of communication. The creation of a fabric topology is an exercise in 3D visualization to get the best performance using a realistic number of switches.
Processors need to interconnect in a way that minimizes the number of hops through switches, while keeping the access time minimal. The number of connections also needs to be minimized within the performance range you need. Too few interconnections will clog up the systems; too many interconnections just sit around waiting for something to transfer, wasting resources and ultimately money. It's like predicting the amount of cell phone minutes you will be using. Too few minutes costs you money after the fact, too many minutes costs you money before the fact. A balance is hard to find, and it changes from system-application load to application load.
Good fabric design can route data using many different paths, analogous to the World Wide Web. This is good news when processor allocation algorithms tend to bunch up threads in small groups of processors. This is bad news if you don't have very good diagnostic tools for the fabric. Finding faulty fabric paths can be an exercise in the blind leading the blind, unless some trace-back facility is available. There are many fabric designs and high-performance switches and adapters that can be classified as commodity parts. For the same reason that it isn't a good idea to design your own microprocessor, don't design your own custom fabric. It isn't worth it.
If you want to keep supercomputers out of the office, use custom processors and fabrics, especially those that are of your own design. This provides the crippling benefit of high development and maintenance costs. However, the best reason comes after three years or so when you need even greater performance. Then your system needs to be architecturally and technically updated, if for no other reason than to keep up with the commodity people. The original billion dollar investment will soon pale with respect to producing ongoing technology turns.