If the CPU is capable of executing millions of instructions per second...

revoo

Reputable
Jul 20, 2014
15
0
4,510
Then are CPU oriented tasks like video encoding, physics, and other work not nearly instantaneous. If the CPU is constantly waiting for data from dram why do we even need a processor with higher performance if it can never reach peak performance, there must be a limit to the processor speed that is effective.
 
Solution
Because there are millions of pixels in a single frame of video, and tens/hundreds of billions in a movie.

CPUs aren't always waiting on DRAM. That's why we have caching, pre-fetch, and similar tech.

There's lots of processing power, but lots of processing to do.
Well, Windows 7 has about 40 million lines of code, Facebook has about 60 million. Games executes tight loops over small pieces of code to render terrain. for video rendering, etc, every pixel gets processes. Do a rough calculation of how many pixels you are rendering. Astrophysics and nuclear energy modeling takes days on petaflop super computers. Your GPU has to process every pixel, as well as render all textures for every frame. Calculate how many operations would that be if each has a single instruction, which it has not.

CPUs have extremely high-speed caches very near them (no transmission delay). In fact, modern CPU architecture has 3 levels of high-speed cache and a huge part of CPU design is to figure out what data has to be in the cache to be ready for the CPU to process. CPU design does things like looking ahead in the stream of instructions to pre-fetch future instructions and data, it decodes instructions, sometimes it even executes instructions out of sequence if that optimizes execution without altering the outcome. It may employ branch prediction to decide which branch of an "if" will be executed, and perhaps store multiple branches of code as pre-fetched, just in case. :)

small programs, such as Prime95, can load their entire instruction set, as well as data in L1 cache that is accessible at the speed of the processor, thus removing all waiting and managing to drive the CPU to 100$ utilization.
 

ddpruitt

Honorable
Jun 4, 2012
1,109
0
11,360
Believe it or not your question is on of those whose answer is incredibly complex and nuanced, billions are spent each work working on the problem.

Essentially the combination of clock frequency and instructions per cycle determines the latency of executing an instruction. In many cases out of order execution, pipelining, the prefetcher, and memory architecture help to limit the effects of slow memory. Although general purpose programs rarely every push a modern CPU. Highly optimized applications can keep a particular functional unit busy around 80% of the time. The vast majority of the programs you run compute in short discrete bursts and sit idle around 80% of the time. For example take something like BF4. The game has to figure out what to be done for a single frame, dispatches the appropriate pieces where they go (frame to GPU, audio to sound card, etc), and then goes to sleep until it needs to ready the next frame.

Currently no CPU can, or is designed to, go flat out. It would melt in short order. Current CPU design is focusing on reduced power usage, tighter integration, and increasing concurrency. If you look at the numbers the raw power of a single core really hasn't increased that much, however FLOPs per watt has. If you really want to see where we're headed look at the HPC space. They're currently looking to break the petaflop barrier but there are a few kinks to work out but they're basically massively parallel machines.

Someday we won't rate CPUs by gigahertz but by gigacores.
 

ddpruitt

Honorable
Jun 4, 2012
1,109
0
11,360
Sorry but this answer is so wrong and misinformed that I just had to reply.


Actually the kernel has 40 million lines, Facebook is nowhere near 60 million (likely nearer 100,000)


Tight loop has a very narrow definition when it comes to code, but I'll overlook that. Games don't iterate over loops, at least well designed ones don't, they're event based.



There are two problems here. First the CPU rarely ever touches a pixel, it touches an object and that gets sent to a GPU that handles pixels. A GPU is a highly parallel SIMD machine that can apply a single operation to thousands of datapoints at once whereas a CPU is a general purpose machine. The memory architecture for GPUs is highly specialized and the GPU generally doesn't have to wait to write to it's memory to complete before moving on. Your also mixing particles and textures, same thing different scale. Second physics modeling doesn't render or compute pixels. Calculations are done on individual particles and the effects propagated throughout the system. The general bottleneck in one of these systems is messaging.



Depends on the CPU. ARM only requires a single level of cache. Intel and AMD are the exception with 3 rather than the rule. Transmission delay and latency are two different things.


Actually this isn't entirely true. Instructions and data are prefetched only if memory access occurs in a predictable pattern. This is generally true but certain types of computions are inherently non-deterministic and prefetching doesn't work. Out of order execution can change the result of optimized versus unoptimized code. It's rare but it's something that one has to be aware of.



Virtually all architectures have some sort of branch predictor, but only a single branch is prefetched. If you could load two branches and operate on the them simultaneously you could discard the one you didn't need, hence the predictor.



An instruction set is part of a CPU micro-architecture. You're referring to the working set. It's doubtful that Prime95 entirely fits within the L1, however the part that does the heavy lifting may fit. You don't have to fit the entire program into cache, just the part that's running at the moment. This problem is referred to as locality and caches are designed with this in mind.



In general L1 latency is 5 to 10 cycles, L2 is 50 to 80, and L3 is 150 to 300. Although they run at the speed of the CPU you still have to wait.


You can't drive a CPU to 100%. You can make it look like the CPU is loaded 100% of the time, but all that means is that there's a single instruction waiting to be retired 100% of the time. In reality a CPU at 100% load is really utilizing around 30% of the transistors (not including caches). Most CPUs have more functional units than dispatch units since code tends to only use a few functional units at a time.
 

ddpruitt

Honorable
Jun 4, 2012
1,109
0
11,360


Actually the load is 100%. All that means is that for 100% of the time (or 50% or whatever) there's an instruction in the pipeline being executed by the CPU. 100% utilization would mean that all of the function units are running and not stalled, and that the pipeline is completely full. In general most CPUs have 3 or more functional units and only about half that number of functional units. Usually only one or two of the functional units are in use at any one time. In generally a CPU at 100% load is only using around 30% of the transistors on the chip.
 


Google this:
how many line of code in windows 7

I found it surprising as well.

OCn2l7e.jpg



 

RobCrezz

Expert
Ambassador


Yeah fair enough, i misunderstood what you said. But you are correct, as im sure the different levels of cache and FPUs etc are not all fully loaded.
 

ddpruitt

Honorable
Jun 4, 2012
1,109
0
11,360


Actually they were incapable of reading when they quoted the numbers. Facebook's lines of code add in all the backend services, binaries, and a bunch of other stuff. Basically they're counting everything three, adding in the OSes, databases and everything else so the number is bogus. The Windows reference is basically a circular reference citing itself, as I recall XP was around 15 million lines for the kernel. Problem with that is what is what you define as an operating system, some people call it just the kernel (Linux) some people include everything including the GUI (Windows). You have to be careful comparing lines of code, people tend to inflate the numbers to make it look like they have more than what they do. Although it's pretty impressive when you consider that the software running the LHC has more parts than the LHC itself.
 
CPU Instruction != SLOC.

Take something like this:

if (x + y > 1) then

That's one SLOC, but to the CPU, it looks like this:

LOAD X into R1
LOAD Y into R2
ADD R1 and R2; STORE in R1
LOAD 1 in R2
COMPARE R1 to R2 (Branch occurs based on result)

1 SLOC turns into 5 instructions. And remember, in games, the idea is to try and run everything 60 times every second. Adds up very quickly that way.