CPU pipeline count

Solution


I went to Humber College for a couple of years before transferring to McMaster University. I had Paul Michaud as an instructor :)

@OP:

The number of pipeline stages is independent of performance. The introduction of pipelining represented a massive departure from the classic CISC* architecture which was not pipelined, but since then it has varied quite a bit. In general, a larger number of pipeline stages is necessary to increase the peak operating frequency of the microprocessor but deeper pipelines come...

Vitric9

Distinguished
I cannot explain without putting my foot in my mouth..here is a link that is easy enough to get for:http://mail.humber.ca/~paul.michaud/Pipeline.htm
But to answer your question it would depend on the technology of the CPU.
 


I went to Humber College for a couple of years before transferring to McMaster University. I had Paul Michaud as an instructor :)

@OP:

The number of pipeline stages is independent of performance. The introduction of pipelining represented a massive departure from the classic CISC* architecture which was not pipelined, but since then it has varied quite a bit. In general, a larger number of pipeline stages is necessary to increase the peak operating frequency of the microprocessor but deeper pipelines come with larger penalties for incorrect branch predictions and are much harder to design. The Northwood and Prescott iterations of the Netburst architecture (Pentium 4) used up to 31 pipeline stages which allowed operating frequencies up to nearly 4Ghz using a 90nm lithographic process. Modern Intel and AMD architectures use variable depth pipelines ranging between 14 and 19 stages on 22-32nm lithographic processes.

If you're interested in learning a bit about microarchitectures I would suggest studying the classic 5 stage RISC pipeline as well as the 8 stage MIPS pipeline.

EDIT: *The classic CISC architecture is dead. All modern microprocessors operate using RISC style operations. All modern x86 microprocessors include a decoding frontend which decodes a single x86 CISC instruction into one or more architecture specific RISC microps. The reason for this is that classic CISC operations are impossible to pipeline by design.
 
Solution

psychoman

Honorable
Jan 17, 2014
29
0
10,530


But my question still remains, will more pipelines give better core performance in genral
 
No, it is heavily dependent upon the load.

Long pipelines are great for very linear tasks where you can predict what instruction you will need. If you have something with lots of data dependencies and if statements that you miss predict you spend a lot of your time with a rather empty pipeline. The P4 Netburst architecture is a great example of a long pipeline that didn't give the performance that was hoped for. The core architecture moved to a much shorter pipeline setup
https://en.wikipedia.org/wiki/NetBurst_(microarchitecture)#Hyper_Pipelined_Technology
 

psychoman

Honorable
Jan 17, 2014
29
0
10,530



I'm not asking about the count of the stages of the pipeline, I'm asking about the pipeline themsleves, because for example haswell and piledriver have 4 pipelines per core, as far as I know, but that's the problem I don't know much about it and that's why I want to learn :)
 




I think you might be confusing threads and cores with the pipeline.
 

psychoman

Honorable
Jan 17, 2014
29
0
10,530


I don't think so.
Here: http://images.anandtech.com/doci/6201/Screen%20Shot%202012-08-28%20at%204.38.05%20PM.png
see each core has 4 pipelines, there are 2 cores with 4 pipelines each and I'm asking if that number of pipelines increases, will it give better performance

 
Ohhhh, then you are instead asking more about multithreading. Having more cores doesn't make a difference if your code doesn't permit you to use them so again, it is heavily dependent upon the load.

If you need to add several independent numbers together then you benefit from more cores and more pipelines, you can spread that load out, but if you have something like this
Z=X+Y
W=U-V
T= ((Z/B)-C*(D-W/C)
Print T
You can do Z and W in parallel, but when you get to that big function you cannot proceed until you calculate Z/B and W/C, everything will be put on hold until you can get those divisions done(which are slow) and solve for T, once you hit that big function it doesn't matter if you have 2 pipelines or 200, you are being restricted by a serial operation. See Amdahl's law, the best speed up you can get is determined by the ratio of parallelizable to serial code in your program, if your program is entirely serial it doesn't matter how many cores or pipelines you have, it will always go the same speed.
https://en.wikipedia.org/wiki/Amdahl's_law
 

psychoman

Honorable
Jan 17, 2014
29
0
10,530


Thanks I think I understood everything, thanks very much.
I just want to round up what you said so I make sure I got it.
So basically more pipelines are like more cores, if the app,program,etc,etc or in this case instructions can't use more than let's say 4 then it will not give you any performance boost, but if it can use the additional pipelines then like with having more cores it will give you a performance boost. Thanks :)
 
Correct. The image you linked earlier was actually just showing the integer units having 4 pipelines each, which means in each integer pipeline you can perform a single operation which will let you do 8 additions, subtractions, multiplications, or divisions in parallel, but only on Integers, if you are dealing with floating point numbers (aka decimals 1.23456) you only have those 2 FMAC pipelines to work with since an integer unit can deal with decimal points. If you are adding a whole bunch of integers you can do it really fast since you have 8 integer pipes, but if you need to add 1.1 to a whole bunch of numbers you are constricted to those two FMAC pipes.

These smaller pipelines are similar to cores but not the same, a core would have its own registers, schedulers, cache, memory unit, FPU, and ALU which each of those smaller pipelines shown in the picture only duplicate stages of the ALU and FPU respectively.


In the end, everything about computer performance depends upon the load, you can optimize something for a massive amount of very tiny operations(GPU) or for fewer larger more complex operations(CPU), but you always trade off performance with the other type. This is why people test CPUs and GPUs using a wide variety of tests.
 


Those are execution pipes, not pipeline stages. Pipeline stage are temporal (divided in time) whereas execution pipes are spatial (divided in space).

In general, each execution pipe can handle transport one macro-op (an architecture specific RISC operation decoded from an x86 CISC instruction) from the reservation station to one execution unit located on that pipe per clock cycle.

Increasing the number execution pipes most certainly does increase performance, and it is the driving factor behind the massive disparity in per-core performance between AMD's FX series microarchitecture and Intel's Core series microarchitecture.

However, increasing the number of execution pipes is very, very difficult. Intel has the R&D budget to do so, AMD does not.

Disclaimer: I'm rather drunk so this may or may not make sense.
 

psychoman

Honorable
Jan 17, 2014
29
0
10,530


It makes sence and thank you for answearing my question :)