Tom's Hardware > Forum > CPU & Components > CPUs > [Solved] Performance Per Transistor

[Solved] Performance Per Transistor

Forum CPU & Components : CPUs - [Solved] Performance Per Transistor

Tom's Hardware: Over 1.4 million members in 6 different countries available to answer all your high-tech questions. Sign up now! Its free!

Best answer from sminlal.

Word :    Username :           
 

Hi All,

I had a few questions for the experts here at Tom's.

For no apparent reason, I became interested in the single thread computational performance of various CPU's (ie. Is a 3Ghz P4 faster than 1 core of a 2Ghz C2D)... Well.. It quickly became apparent that the answer was no... lol..

It did however lead me into some interesting thoughts (at least interesting to me) in that I noticed that a P4, transistor for transistor was more efficient than a wolfdale C2D.

This lead me to some interesting questions that I do not know the answer to. I was hoping that someone here could enlighten me.

Why is performance per transistor going down? Is it due to the physical limitations of the transistor shrinks causing a loss of efficiency, or is it completely a design implementation issue? As an example of what I mean. If a 55million transistor 130nm northwood P4 was built using today's tech (45nm as I write this) so that it was a 55million transistor 45nm P4 and all logic stayed the same, would it perform less efficiently due to changes in the transistor manufacture?

If the answer to the above question is no. Would it not then be better to have an 8 core P4 3Ghz Northwood at 55million 45nm transistors per core than to have 2 205million 45nm Wolfdale cores? The "new" P4 would then have 2x the overall computational performance?

Anyway, I would love to understand this, and I wrote a little benchmark program with some charts of my CPU's I have at home and work. I would also love to have more data from my program from OLD CPU's, PII's and P-MMX etc. if anyone has that kind of stuff lying around.

You can download and run my small program from http://astro.temple.edu/~drhoads/cpu if you want to help me collect data.

Thanks to anyone for their answers or additional data.

For decades designers used the extra transistors that became available every year (courtesy of Moore's law) to improve performance through the use of instruction level parallelism. This is the reason every modern CPU uses a pipeline - the purpose of a pipeline is to be able to do work on many instructions at the same time in parallel, much the same way that an automobile assembly line can be building lots of cars at the same time, all in different stages of assembly.

Within the last 10 years or so what's changed is that the engineers have basically run out of slick tricks to get more parallelism out of a single stream of instructions. After pipelines and caches came more and more esoteric techniques such as branch prediction, superscalar execution, macro-operation fusion, register renaming, and speculative execution.

...and that's about it. Since the around the time of the Pentium 4, there really haven't been any new ways to get more parallelism from a single instruction stream. All that's happened is larger caches, wider data paths and lots of tweaks to make sure the pipelines are highly utilized. The biggest new architectural ideas came with Itanium (explicit parallelism, predicated instructions), but those can't be applied to the x86 architecture because they break backward compatibility.

So designers basically ran out ideas to use the extra transistors and have now turned to replicating entire cores and increasing cache sizes. Unless someone gets a really bright idea, that's probably the way it's going to be for a while.

It may be that any future performance breakthroughs will have to come from new compilers that are able to take procedural code and produce multi-threaded workloads that can take advantage of the burgeoning number of cores the hardware guys are throwing at us...
Sponsored Links
Register or log in to remove.

It's fairly easy to test, since 55 million x 8 is so close to 205 million, it stands to reason that the wolfdale CPU would have to be just under 8X faster than a northwood. That said, I remember back when the conroe architecture first came out, I don't remember it being nearly that much faster even in the best case scenarios.

The addition of processor features is most likely the reason for this (I'm guessing), I can't see any other reason why so many engineers would overlook something that stupidly obvious.

------------------------------ Phenom II x3 710 @ x4 2.93GHz
3GB DDR2 600
Radeon HD 4830 512mb
Reply to HibyPrime

HibyPrime wrote :

It's fairly easy to test, since 55 million x 8 is so close to 205 million, it stands to reason that the wolfdale CPU would have to be just under 8X faster than a northwood. That said, I remember back when the conroe architecture first came out, I don't remember it being nearly that much faster even in the best case scenarios.

The addition of processor features is most likely the reason for this (I'm guessing), I can't see any other reason why so many engineers would overlook something that stupidly obvious.



I suppose that makes sense if it is the addition of more cache, etc. Why then is it better to add that more cache/feature than to include another core? Is raw data processing less important in modern computing then for some reason that makes the addition of more cache and other features more valuable?

Reply to dr_hoads

I'm not sure.

A quick glance at this: http://techreport.com/r.x/core-i7/die-callout.jpg shows that there is quite probably enough room on the core i7 to remove half its cache and add a 5th core. I can't see the i7 performing that much worse on 4MB cache.

It might have something to do with not wanting to use hugely different dies for the server variants, because I'd imagine there are a lot of commercial apps that would use the extra cache.

------------------------------ Phenom II x3 710 @ x4 2.93GHz
3GB DDR2 600
Radeon HD 4830 512mb
Reply to HibyPrime

Well, let's see:
Going from 32 to 64 bit words just about doubles transistor count at little gain in "efficiency" as measured in mips.

An improved instruction set will increase the transistor count without necessarily increasing efficiency.

And, besides you, who cares about transistor count divided by <whatever> as a measure of efficiency? :)

Reply to jsc

We're going to need to start caring soon. Right about when we hit a wall at 300W TDP @ 18nm and theres no way to improve on the power or size. I suppose power consumption could continue to climb as they add more transistors, but I suspect consumers will start getting ticked if they go much past 250-300w per silicon chip.

It's actually kind of scary how close we are to not being able to physically (read: not architecturally) improve processors...

------------------------------ Phenom II x3 710 @ x4 2.93GHz
3GB DDR2 600
Radeon HD 4830 512mb
Reply to HibyPrime

Its not about transistor count, its about cpu architecure. the most distinct difference being that C2D has a MUCH shorter pipeline than the p4

------------------------------ macgirlfriend:
"Hey I don't get you people, the people on insanely mac were so much nicer"
Reply to skittle

It may only just be me, but I can't say that I am very excited about a 18billion transistor CPU if it is only 2x faster than a 1billion one. There must be something else interesting going on here that I don't know about... Or, as HibbyPrime says... something scarry!! :-)

Reply to dr_hoads

HibyPrime wrote :

I'm not sure.

 

A quick glance at this: http://techreport.com/r.x/core-i7/die-callout.jpg shows that there is quite probably enough room on the core i7 to remove half its cache and add a 5th core. I can't see the i7 performing that much worse on 4MB cache.

 

It might have something to do with not wanting to use hugely different dies for the server variants, because I'd imagine there are a lot of commercial apps that would use the extra cache.

 


Keep in mind that for a given die size, cache uses much less power and makes much less heat than cores. A 5 core i7 with some of the cache removed (for a total die size identical to the current i7) would use more power and run hotter.

Message quoted 2 times
Message edited by cjl on 11-13-2009 at 06:06:29 AM
------------------------------ Asus P6T deluxe
i7 965 @ 4.2GHz (200*21), 1.384V
12GB Corsair Dominator DDR3-1600 CAS 7
Reply to cjl

cjl wrote :

Keep in mind that for a given die size, cache uses much less power and makes much less heat than cores. A 5 core i7 with some of the cache removed (for a total die size identical to the current i7) would use more power and run hotter.



Ahhh, and there's our answer.

------------------------------ Phenom II x3 710 @ x4 2.93GHz
3GB DDR2 600
Radeon HD 4830 512mb
Reply to HibyPrime

cjl wrote :

Keep in mind that for a given die size, cache uses much less power and makes much less heat than cores. A 5 core i7 with some of the cache removed (for a total die size identical to the current i7) would use more power and run hotter.



I did not know that the cache does not get as hot, that makes a lot of sense. Thank You! It does, however, makes me want to ask questions about Larrabee, (I think the premise of that one is to have a ton of small cores) but I am too tired at the moment to read up on it in Wikipedia. :-p

Reply to dr_hoads

I agree... I think the internal caches on modern CPUs are causing the transistor count to rise quicker than our performance gains. The original Pentium had what... 8k of L1 on the the CPU? Back then the L2 was still on the motherboard and L3 was just the stuff of dreams. It is interesting to think about though. Although I go back to the TRS80/Vic20, I recall that my 486 had around 2 million transistors... now my GPU has 1000 times that amount. My how things change!

Reply to rodney_ws
Best answer

For decades designers used the extra transistors that became available every year (courtesy of Moore's law) to improve performance through the use of instruction level parallelism. This is the reason every modern CPU uses a pipeline - the purpose of a pipeline is to be able to do work on many instructions at the same time in parallel, much the same way that an automobile assembly line can be building lots of cars at the same time, all in different stages of assembly.

Within the last 10 years or so what's changed is that the engineers have basically run out of slick tricks to get more parallelism out of a single stream of instructions. After pipelines and caches came more and more esoteric techniques such as branch prediction, superscalar execution, macro-operation fusion, register renaming, and speculative execution.

...and that's about it. Since the around the time of the Pentium 4, there really haven't been any new ways to get more parallelism from a single instruction stream. All that's happened is larger caches, wider data paths and lots of tweaks to make sure the pipelines are highly utilized. The biggest new architectural ideas came with Itanium (explicit parallelism, predicated instructions), but those can't be applied to the x86 architecture because they break backward compatibility.

So designers basically ran out ideas to use the extra transistors and have now turned to replicating entire cores and increasing cache sizes. Unless someone gets a really bright idea, that's probably the way it's going to be for a while.

It may be that any future performance breakthroughs will have to come from new compilers that are able to take procedural code and produce multi-threaded workloads that can take advantage of the burgeoning number of cores the hardware guys are throwing at us...


Message edited by sminlal on 11-13-2009 at 09:14:59 AM
Reply to sminlal

Thanks everyone for the interesting conversation... So to summarize:

1.) There is no inherent degradation in transistor manufacture. If you built and old CPU on today’s process, performance would be relatively the same.

2.) No new ways to add more parallelism, so they add features instead.

3.) I am the only person interested in that statistic.. lol... (I still find tracking it interesting even given these answers)

Thank you all for helping me understand this.

Reply to dr_hoads
Tom's Hardware > Forum > CPU & Components > CPUs > [Solved] Performance Per Transistor
Go to:

There are 1237 identified and unidentified users. To see the list of identified users, Click here.

Sponsored links
  • Ask the community now
  • Publish
Ad
They won a badge
Join us in greeting them