Taming the Wild Supercomputer

Customize The Operating System

In the good ol' days of supercomputers (the '70s, for you youngsters), feature-rich, stable operating systems were hard to find, probably because they didn't exist. I know this for a fact, because I was in charge of one of the early operating systems. We did the best we could, but we were just feeling our way along. Anyone who has implemented large computer systems knows that you have to suppress your feelings, else the 100 hour work weeks become intolerable. That naturally resulted in suppressed operating systems where designers cared less about how the user felt. Our goal was to get something that worked and was useful.

Today, there are many capable, more or less stable systems that provide good feature-rich functionality. They do, however, need a rich company to upgrade those rich features so that the system becomes ever richer. It is not uncommon to update systems at least once a month, if for no other reason than to add the latest security patch.

Supercomputers provide an easy way out. Their operating systems don't need lots of sophistication because a graphical user interface (GUI) is unnecessary. The GUI is important, but it can be entirely contained within a PC that is connected to the supercomputer. Then the PC's user interface can be used. The supercomputer's operating system can be cut down to the minimum features needed to support the applications, not the users. PC operating systems that are ported to supercomputers don't make a lot of sense. Something straightforward, like Linux, does. PCs are tolerant of users; supercomputers are good at running applications fast. The two don't need to twine.

There are more and more Linux systems in supercomputers than ever before. This system is free for the taking and is maintained by a whole raft of passionate enthusiasts, all hoping to prove that Linux is God's gift to computers. The gift part is OK; the caveat is that upgrades need to be carefully chosen. That's the downside of freeware. The good news is that a basic Linux system is all we need, along with some carefully chosen application-support codes for multithreading, etc. These support codes will probably cost some real money, but they are typically well supported and improve dramatically over time due to competition.

Inter-processor communication is vital. Applications can be spread over 10's of thousands of processors, and all the processors need to stay more or less synchronized. This synchronization requires operating system communication (calls). Therefore, the time the OS takes to process the calls is vital for overall application performance.

The applications' approach to OS performance issues is to simply bypass the OS and allow direct communication with other threads via multiprocessor interfaces (MPI). Some MPIs cleverly allow direct communication to proceed without confusion, even though the OS is bypassed. The MPI stack is handled as subroutine calls at the application level. The MPI can further reduce OS complexity. Wouldn't it be nice if the OS would support direct MPI calls?

An area that needs creative attention is support for troubleshooting and the display of system status. Supercomputers typically have thousands of processors with millions of possible communication paths. There isn't any predictable processor-to-processor path. This can lead to anxiety when you are trying to track down a communication bug or a lazy processor. A good system status display that doubles as a debugging tool can be a lifesaver, if only one existed...

So, a good way to keep supercomputers out of the office is to use a custom operating system that inhibits direct application-to-application communication. This provides two crippling negative benefits. One of these is its cost. A custom operating system will suck up lots of R & D money to keep it going, so you won't get one of your own unless you are important. Management balks at important people; they are expensive. The other benefit will be your cube size. You will have to clone yourself at least once to keep your application moving along with the blows dealt by your punchy operating system. Yep, you will need a bigger cube. Management balks at big cubes, but not if there are lots of people in each of them.