Performance Per Transistor

dr_hoads

Distinguished
Nov 12, 2009
7
0
18,510
Hi All,

I had a few questions for the experts here at Tom's.

For no apparent reason, I became interested in the single thread computational performance of various CPU's (ie. Is a 3Ghz P4 faster than 1 core of a 2Ghz C2D)... Well.. It quickly became apparent that the answer was no... lol..

It did however lead me into some interesting thoughts (at least interesting to me) in that I noticed that a P4, transistor for transistor was more efficient than a wolfdale C2D.

This lead me to some interesting questions that I do not know the answer to. I was hoping that someone here could enlighten me.

Why is performance per transistor going down? Is it due to the physical limitations of the transistor shrinks causing a loss of efficiency, or is it completely a design implementation issue? As an example of what I mean. If a 55million transistor 130nm northwood P4 was built using today's tech (45nm as I write this) so that it was a 55million transistor 45nm P4 and all logic stayed the same, would it perform less efficiently due to changes in the transistor manufacture?

If the answer to the above question is no. Would it not then be better to have an 8 core P4 3Ghz Northwood at 55million 45nm transistors per core than to have 2 205million 45nm Wolfdale cores? The "new" P4 would then have 2x the overall computational performance?

Anyway, I would love to understand this, and I wrote a little benchmark program with some charts of my CPU's I have at home and work. I would also love to have more data from my program from OLD CPU's, PII's and P-MMX etc. if anyone has that kind of stuff lying around.

You can download and run my small program from http://astro.temple.edu/~drhoads/cpu if you want to help me collect data.

Thanks to anyone for their answers or additional data.
 
Solution
For decades designers used the extra transistors that became available every year (courtesy of Moore's law) to improve performance through the use of instruction level parallelism. This is the reason every modern CPU uses a pipeline - the purpose of a pipeline is to be able to do work on many instructions at the same time in parallel, much the same way that an automobile assembly line can be building lots of cars at the same time, all in different stages of assembly.

Within the last 10 years or so what's changed is that the engineers have basically run out of slick tricks to get more parallelism out of a single stream of instructions. After pipelines and caches came more and more esoteric techniques such as branch prediction...

HibyPrime

Distinguished
Nov 23, 2006
263
0
18,790
It's fairly easy to test, since 55 million x 8 is so close to 205 million, it stands to reason that the wolfdale CPU would have to be just under 8X faster than a northwood. That said, I remember back when the conroe architecture first came out, I don't remember it being nearly that much faster even in the best case scenarios.

The addition of processor features is most likely the reason for this (I'm guessing), I can't see any other reason why so many engineers would overlook something that stupidly obvious.
 

dr_hoads

Distinguished
Nov 12, 2009
7
0
18,510


I suppose that makes sense if it is the addition of more cache, etc. Why then is it better to add that more cache/feature than to include another core? Is raw data processing less important in modern computing then for some reason that makes the addition of more cache and other features more valuable?
 

HibyPrime

Distinguished
Nov 23, 2006
263
0
18,790
I'm not sure.

A quick glance at this: http://techreport.com/r.x/core-i7/die-callout.jpg shows that there is quite probably enough room on the core i7 to remove half its cache and add a 5th core. I can't see the i7 performing that much worse on 4MB cache.

It might have something to do with not wanting to use hugely different dies for the server variants, because I'd imagine there are a lot of commercial apps that would use the extra cache.
 
Well, let's see:
Going from 32 to 64 bit words just about doubles transistor count at little gain in "efficiency" as measured in mips.

An improved instruction set will increase the transistor count without necessarily increasing efficiency.

And, besides you, who cares about transistor count divided by <whatever> as a measure of efficiency? :)
 

HibyPrime

Distinguished
Nov 23, 2006
263
0
18,790
We're going to need to start caring soon. Right about when we hit a wall at 300W TDP @ 18nm and theres no way to improve on the power or size. I suppose power consumption could continue to climb as they add more transistors, but I suspect consumers will start getting ticked if they go much past 250-300w per silicon chip.

It's actually kind of scary how close we are to not being able to physically (read: not architecturally) improve processors...
 

dr_hoads

Distinguished
Nov 12, 2009
7
0
18,510
It may only just be me, but I can't say that I am very excited about a 18billion transistor CPU if it is only 2x faster than a 1billion one. There must be something else interesting going on here that I don't know about... Or, as HibbyPrime says... something scarry!! :)
 



Keep in mind that for a given die size, cache uses much less power and makes much less heat than cores. A 5 core i7 with some of the cache removed (for a total die size identical to the current i7) would use more power and run hotter.
 

HibyPrime

Distinguished
Nov 23, 2006
263
0
18,790


Ahhh, and there's our answer.
 

dr_hoads

Distinguished
Nov 12, 2009
7
0
18,510


I did not know that the cache does not get as hot, that makes a lot of sense. Thank You! It does, however, makes me want to ask questions about Larrabee, (I think the premise of that one is to have a ton of small cores) but I am too tired at the moment to read up on it in Wikipedia. :p
 

rodney_ws

Splendid
Dec 29, 2005
3,819
0
22,810
I agree... I think the internal caches on modern CPUs are causing the transistor count to rise quicker than our performance gains. The original Pentium had what... 8k of L1 on the the CPU? Back then the L2 was still on the motherboard and L3 was just the stuff of dreams. It is interesting to think about though. Although I go back to the TRS80/Vic20, I recall that my 486 had around 2 million transistors... now my GPU has 1000 times that amount. My how things change!
 
For decades designers used the extra transistors that became available every year (courtesy of Moore's law) to improve performance through the use of instruction level parallelism. This is the reason every modern CPU uses a pipeline - the purpose of a pipeline is to be able to do work on many instructions at the same time in parallel, much the same way that an automobile assembly line can be building lots of cars at the same time, all in different stages of assembly.

Within the last 10 years or so what's changed is that the engineers have basically run out of slick tricks to get more parallelism out of a single stream of instructions. After pipelines and caches came more and more esoteric techniques such as branch prediction, superscalar execution, macro-operation fusion, register renaming, and speculative execution.

...and that's about it. Since the around the time of the Pentium 4, there really haven't been any new ways to get more parallelism from a single instruction stream. All that's happened is larger caches, wider data paths and lots of tweaks to make sure the pipelines are highly utilized. The biggest new architectural ideas came with Itanium (explicit parallelism, predicated instructions), but those can't be applied to the x86 architecture because they break backward compatibility.

So designers basically ran out ideas to use the extra transistors and have now turned to replicating entire cores and increasing cache sizes. Unless someone gets a really bright idea, that's probably the way it's going to be for a while.

It may be that any future performance breakthroughs will have to come from new compilers that are able to take procedural code and produce multi-threaded workloads that can take advantage of the burgeoning number of cores the hardware guys are throwing at us...
 
Solution

dr_hoads

Distinguished
Nov 12, 2009
7
0
18,510
Thanks everyone for the interesting conversation... So to summarize:

1.) There is no inherent degradation in transistor manufacture. If you built and old CPU on today’s process, performance would be relatively the same.

2.) No new ways to add more parallelism, so they add features instead.

3.) I am the only person interested in that statistic.. lol... (I still find tracking it interesting even given these answers)

Thank you all for helping me understand this.