Applications that run efficiently (i.e., fast) on supercomputers can be difficult to write. This is due largely to the fact that a single application needs to be segmented in a way that disperses itself into many processors. Luckily, processor architecture provides two avenues for segmenting code for processor arrays; multi threading and vectorization. Two or four processors don't pose much of a problem, but 10's of thousands do. The ease of application segmentation can be categorized from the embarrassingly simple to the down-right impossible.
With apologies to the graphics industry, an embarrassingly simple example is video frame generation, where each frame can be created in a separate processor. A difficult yet solvable example would be to divide each video frame into lots and lots of segments. A really, really difficult example would be to segment any code that doesn't have loops or recursion. Luckily, the latter applications are hard to find. However, I believe that it is possible to segment even really, really difficult applications. Such applications need to be rethought, or morphed into a different algorithm that either includes some recursion or processes data in batches. The caveat is that the new algorithm is often much more complex. The law of diminishing returns is alive and well, Complexity can increase to the point where it's not really worth it.
Compilers provide a means for transforming a problem solution into machine code that can run on various processor architectures. Modern compilers tend to be multipurpose tools from the point of view that they are used for system and application development. There is a healthy trend toward splitting compiler languages into either systems syntax or application syntax. However, there are miles to go. Modern compiler languages do a good job of providing many different ways to write application code.
To fully understand the complexity and nuances of a compiler language, you will need a brain the size of a bathtub. In other words, many compiler languages are unfortunately focused on the elegance of the solution, not the problem itself. There's lots of entrepreneurial work to do here. What is really needed is a high-level syntax that describes the algorithm, not the details of the algorithm's solution. This gives compiler writers freedom to better match applications to processor hardware.
Compiler writers have devoted enormous effort toward compiling application code into vectors or array-oriented detection. If an array is found that matches an array instruction, viola! Today, compilers basically take a brute-force approach to the problem. They look for specific kinds of loops that lend themselves to array instructions. Brute-force approaches are not very smart. They lend themselves to bruising whenever they can't detect a loop that cannot be transformed into simpler arrays. Believe me when I say there are plenty of those. If you can figure out a general way to vectorize or construct simpler arrays from loops to match microprocessors' native instructions, you will become a visible entrepreneur of the rich and famous category.
Application algorithms form some sort of sequential logic. The sequential logic usually contains many segments, many of which can be broken down into individual threads. The threads can be launched and synchronized through normal operating system extensions. This lends itself to speed-ups by running many threads in parallel.
For iterative codes, the application can be pipelined similarly to microprocessor pipelines. This gets a lot of iterations going in parallel, and can avoid many synchronization pitfalls. The problem comes when an application does not lend itself to multithreading. Such an application can still be approached by viewing its algorithm from the perspective of horizontal segmentation, instead of a vertical set of instructions. This requires a complete rethinking of the algorithm, but almost always can result in some sort of efficient multithreading.
An example that is familiar to me goes back to the early '70s when I was working at CDCs Advanced Design Laboratory, writing a COBOL compiler for the Star 100 vector computer. The instruction set of that computer was modeled after Ken Iverson's book, "A Programming Language." One of the architects, Neil Lincoln, wrote a FORTRAN compiler that had no test or loop statements. It consisted entirely of 147 sparse-vector operations. Each sparse-vector operation on that computer was a single instruction. Needless to say, the compiler ran like the clappers and Neil deserves a big round of applause for the achievement. I'm mentioning this because creative juices run much stronger than preconceptions. Preconceived ideas ultimately turn sour because they tend to stop creativity. Preconceived ideas make it difficult for time to march on.
So, if you want to keep a supercomputer out of the office, sell management on the perception that applications need to run untouched. Management will not be able to find a supercomputer that will run all your applications efficiently enough to warrant the cost of the supercomputer. However, you can convince management that it takes lots of effort to redo the applications. It will also take lots of money. In addition, all your applications need to be re-ported as processor architecture improves. This is particularly fun when the original developers have moved on and the source code cannot be found. Management will balk at the side benefit of extra cost because extra people and cubes are needed. This guarantees that the supercomputer, at least in the near term, will not be cost-effective. This saves lots of money, because it would be expensive to buy one for general use.
Still, if you want to become a rich entrepreneur, then figure out easy ways to rewrite algorithms for applications, either through developing smarter compilers or by becoming proficient at rethinking algorithms. Or, better yet, develop a system where the problem can be stated in high-level terms that are converted directly into machine language. This would bypass the need to script difficult to understand solutions that are even more difficult to port to supercomputer architectures.