Sorry but this answer is so wrong and misinformed that I just had to reply.
Karsten75 :
Well, Windows 7 has about 40 million lines of code, Facebook has about 60 million.
Actually the kernel has 40 million lines, Facebook is nowhere near 60 million (likely nearer 100,000)
Karsten75 :
Games executes tight loops over small pieces of code to render terrain.
Tight loop has a very narrow definition when it comes to code, but I'll overlook that. Games don't iterate over loops, at least well designed ones don't, they're event based.
Karsten75 :
for video rendering, etc, every pixel gets processes. Do a rough calculation of how many pixels you are rendering. Astrophysics and nuclear energy modeling takes days on
petaflop super computers. Your GPU has to process every pixel, as well as render all textures for every frame. Calculate how many operations would that be if each has a single instruction, which it has not.
There are two problems here. First the CPU rarely ever touches a pixel, it touches an object and that gets sent to a GPU that handles pixels. A GPU is a highly parallel SIMD machine that can apply a single operation to thousands of datapoints at once whereas a CPU is a general purpose machine. The memory architecture for GPUs is highly specialized and the GPU generally doesn't have to wait to write to it's memory to complete before moving on. Your also mixing particles and textures, same thing different scale. Second physics modeling doesn't render or compute pixels. Calculations are done on individual particles and the effects propagated throughout the system. The general bottleneck in one of these systems is messaging.
Karsten75 :
CPUs have extremely high-speed caches very near them (no transmission delay). In fact, modern CPU architecture has 3 levels of high-speed cache and a huge part of CPU design is to figure out what data has to be in the cache to be ready for the CPU to process.
Depends on the CPU. ARM only requires a single level of cache. Intel and AMD are the exception with 3 rather than the rule. Transmission delay and latency are two different things.
Karsten75 :
CPU design does things like looking ahead in the stream of instructions to pre-fetch future instructions and data, it decodes instructions, sometimes it even executes instructions out of sequence if that optimizes execution without altering the outcome.
Actually this isn't entirely true. Instructions and data are prefetched only if memory access occurs in a predictable pattern. This is generally true but certain types of computions are inherently non-deterministic and prefetching doesn't work. Out of order execution can change the result of optimized versus unoptimized code. It's rare but it's something that one has to be aware of.
Karsten75 :
It may employ branch prediction to decide which branch of an "if" will be executed, and perhaps store multiple branches of code as pre-fetched, just in case.
Virtually all architectures have some sort of branch predictor, but only a single branch is prefetched. If you could load two branches and operate on the them simultaneously you could discard the one you didn't need, hence the predictor.
Karsten75 :
small programs, such as Prime95, can load their entire instruction set, as well as data in L1 cache
An instruction set is part of a CPU micro-architecture. You're referring to the working set. It's doubtful that Prime95 entirely fits within the L1, however the part that does the heavy lifting may fit. You don't have to fit the entire program into cache, just the part that's running at the moment. This problem is referred to as locality and caches are designed with this in mind.
Karsten75 :
that is accessible at the speed of the processor, thus removing all waiting
In general L1 latency is 5 to 10 cycles, L2 is 50 to 80, and L3 is 150 to 300. Although they run at the speed of the CPU you still have to wait.
Karsten75 :
and managing to drive the CPU to 100$ utilization.
You can't drive a CPU to 100%. You can make it look like the CPU is loaded 100% of the time, but all that means is that there's a single instruction waiting to be retired 100% of the time. In reality a CPU at 100% load is really utilizing around 30% of the transistors (not including caches). Most CPUs have more functional units than dispatch units since code tends to only use a few functional units at a time.