Download the Tom's Hardware App from the App Store
The reference for current tech news
Yes No
Ads
Tom's Hardware > Forum > CPU & Components > CPUs > Bulldozer's IPC Death- "1000 little cuts"

Bulldozer's IPC Death- "1000 little cuts"

Forum CPU & Components : CPUs Bulldozer's IPC Death- "1000 little cuts"

Page:    Previous 1 2 3 Next Bottom Search this thread
Word :    Username :           
 

Now that Bulldozer is out in the wild, and we've had a few days to digest what it is, this thread will be about Bulldozer's foundation and what it might lead to.

I have two articles for everyone to read before they start the flamewars ;) :
http://semiaccurate.com/2011/10/17 [...] nderwhelm/
http://semiaccurate.com/2011/10/17 [...] e-problem/
---------------
It seems that Bulldozer is a very unfinished product as many forum members have been saying.

There are many quirks with Bulldozer that I do not like. Charlie seems to have the same idea on a few things as I do.

The decode stage is up with 4 decoders to feed the cores, but overall the decoded instructions is down per core. It was 3 in Phenom II and is now 2 in Bulldozer. I thought that this would be no big deal mostly due to improvements elsewhere, but that didn't happen. Only two math instructions happen in the integers at one time in a core compared to 3 max in Phenom. Usually this isn't a big deal; the program might not use that third out of order integer core, but it can affect performance in situations that it does.

L1 and L2 cache latency is up; not just a little. L1 is up 1 cycle, that is a 33% increase in latency for a smaller L1 cache. L2 cache is around 25-27 cycles, a near 70% increase over Phenom! What were they thinking allowing these terrible latencies in Bulldozer's design? I agree with Charlie, they should have made the L2 cache at 1MB with lower latency. Instead they could've had that smaller L2 with the L3 as fillover for any extra instructions needed, much like Sandy Bridge.

It also appears that Glofo isn't totally to blame for Bulldozer's woes on clockspeed and power. There needs to be some fine tuning to the architecture before we see what Bulldozer was really meant to be speed wise; Piledriver might just be this finely tuned Bulldozer.
------
What does everyone think about Bulldozer?

I would like to talk about the architecture if anyone wants to get down and dirty with me. :sol:

Reply to Haserath
Register or log in to remove.

If things designed in an arch were made to do their job and shut down, but instead, it has to work harder, temps and clocks can be effected.
The cache cycles dont look good, and its slow anyways, and both Llano and BD have the same amount of cycles on the L2, and both have reached their curve power/perf, and its alot of uphill from there for power and heat, as the perf curve drops off

------------------------------ If we lose this freedom of ours, history will record with the greatest astonishment, those who had the most to lose, did the least to prevent its happening
Reply to JAYDEEJOHN

I will say this once, keep it clean people. No name calling, attacks or majorly off topic. I am not too worried about off topic, its the other stuff that annoys me. We are all adults (well maybe not all of us but we should act it), act like it. Tired of threads getting closed. I will try to stay on top of it but so help me Jebus......

With that out of the way, I agree. For some reason AMD wants a large L2 cache thats slower. Intel dropped large L2 cache with Nehalem as they have a large L3 cache instead. And I don't think they would need a fillover L3 but more that they could easily do what Intel does, use the larger L3 as a buffer for the instructions so it doesn't have to recall it from the much slower system RAM if it needs it again.

Overall, BD is not impressive. Thats because of the hype. But I can see the potential in the idea if AMD can pull a "Deneb" with it. Still I fear it wont push Intel enough.

No GF is not the only one to blame as they sort of "inherited" AMDs 32nm process. I also think that Piledriver will be the BD that was meant to be, too bad it will have to face SB-E and IB instead of SB.

Reply to jimmysmitty

Nice assessment
There are differences between the 2 chips, and is probably why the cache size/heiarchy is different.
Alot can be done, I agree, and remember, this is first gen 32nm, and a whole new chip, a whole new approach.
I think theyll find what looks good in sims are different than real usage, and those adjustments will come

------------------------------ If we lose this freedom of ours, history will record with the greatest astonishment, those who had the most to lose, did the least to prevent its happening
Reply to JAYDEEJOHN

JAYDEEJOHN wrote :

If things designed in an arch were made to do their job and shut down, but instead, it has to work harder, temps and clocks can be effected.
The cache cycles dont look good, and its slow anyways, and both Llano and BD have the same amount of cycles on the L2, and both have reached their curve power/perf, and its alot of uphill from there for power and heat, as the perf curve drops off


I didn't know Llano had that much of an increase in L2 latency over Phenom II. Llano barely outperforms Athlon II for the huge increase in transistors though. What is the point of doubling the cache if you also nearly double the latency? :pfff:

Reply to Haserath

jimmysmitty wrote :

I will say this once, keep it clean people. No name calling, attacks or majorly off topic. I am not too worried about off topic, its the other stuff that annoys me. We are all adults (well maybe not all of us but we should act it), act like it. Tired of threads getting closed. I will try to stay on top of it but so help me Jebus......

With that out of the way, I agree. For some reason AMD wants a large L2 cache thats slower. Intel dropped large L2 cache with Nehalem as they have a large L3 cache instead. And I don't think they would need a fillover L3 but more that they could easily do what Intel does, use the larger L3 as a buffer for the instructions so it doesn't have to recall it from the much slower system RAM if it needs it again.

Overall, BD is not impressive. Thats because of the hype. But I can see the potential in the idea if AMD can pull a "Deneb" with it. Still I fear it wont push Intel enough.

No GF is not the only one to blame as they sort of "inherited" AMDs 32nm process. I also think that Piledriver will be the BD that was meant to be, too bad it will have to face SB-E and IB instead of SB.


Sandy Bridges L3 cache latency- 26-31 cycles v. Bulldozer's L2 cache latency 25-27 cycles. Sure it's nearly the same latency, but Sandy Bridge's cache level is one higher!

So a single core has to go through the faster L1, the faster L2, then the L3 cache in Sandy Bridge. Bulldozer only hits its fast L1 before hitting a slow L2; they might as well shoot themselves in the foot right there.

I'm wondering why they made the L2 so big. The core would need good prediction to have small caches with high hit rates. I have seen what branching can do to Bulldozer's performance, and it's not pretty. I don't even think better prediction would increase performance all that much unless it was excellent. The only real increase on per core performance could be predicting instructions to fit into the L1 before they're needed, but the L1 is so small that it would be tough to keep getting correct predictions. Once you're at the L2, you're already done; no performance can be gained through prediction.

I don't think Bulldozer's foundation is strong enough to push Intel. They went for faster frequencies, but with a power/heat wall in their way.

Edit: grammar nazi :)

Message quoted 1 times
Message edited by Haserath on 10-18-2011 at 04:55:52 AM
Reply to Haserath

good analysis. My thoughts are that they should scrap bulldozer and go back to updating their Phenom II core. Add faster cache, new sse instructions, could add a couple more cores if they wanted to, and make it in the smaller 32nm porcess, and you would have a better cpu than bulldozer. They could then keep working on the bulldozer core in the background instead of selling the public engineering samples.

------------------------------ INTEL CORE 2 Q6600 @ 3.49GHz, CM Hyper TX3, ASUS P5N-D, 8GB DDR800 RAM, Powercolour HD6850, 650w Antec Trupower New PSU
Reply to iam2thecrowe

iam2thecrowe wrote :

good analysis. My thoughts are that they should scrap bulldozer and go back to updating their Phenom II core. Add faster cache, new sse instructions, could add a couple more cores if they wanted to, and make it in the smaller 32nm porcess, and you would have a better cpu than bulldozer. They could then keep working on the bulldozer core in the background instead of selling the public engineering samples.


That would in fact improve performance over Bulldozer.

Increase L3 size while decreasing the cache latencies. Improve prediction for the core. Make the cache inclusive v. exclusive. Increase core count to 8. Bam, just made a Phenom III that outperforms Bulldozer.

Reply to Haserath
- 0 +

Haserath wrote :

That would in fact improve performance over Bulldozer.

Increase L3 size while decreasing the cache latencies. Improve prediction for the core. Make the cache inclusive v. exclusive. Increase core count to 8. Bam, just made a Phenom III that outperforms Bulldozer.

The question in my mind is why did the cache latencies increase so much? It must be something to do with the fabrication technology, because logically you wouldn't want to design a chip for high frequencies, then hobble it by increasing cache latencies. So they must have had no choice???

Reply to sonoran

When do people think Piledriver variants will be released?

Reply to Chad Boga

Chad Boga wrote :

When do people think Piledriver variants will be released?


Trinity is supposed to be out in Q1 2012. Hopefully it's released then and hopefully it performs better.

Quote :

The question in my mind is why did the cache latencies increase so much? It must be something to do with the fabrication technology, because logically you wouldn't want to design a chip for high frequencies, then hobble it by increasing cache latencies. So they must have had no choice???


I'm not exactly sure why. It doesn't make sense to me at all. The pentium 4 was left behind due to a flawed concept that higher frequency was better. Sacrificing IPC gain means sacrificing efficiency.

Reply to Haserath

Do you guys think that cache latency was increased due to increased in die size?
In SB due to it's smaller die size everything's cramped together and may be that's why the latency is low. (?)

I'm no pro but I think having only 4 Floating Point calculation modules is what to blame. I use resources heavy applications like Maya, 3ds Max and these applications use FPs for almost everything. Even media encoders use FPs (I'm guessing).
So technically for these applications, it's like having only four cores.
For such application bd is nothing but a four core hyper-threaded (4c/8t) CPU.

Anyway,
1. The six-core BD could out perform eight-core BD, if they physically remove disabled cores and put another 2 FP modules there. I know it can't be done overnight but still a food for thought. :D

2. AMD, most of the time stays ahead of Intel when it comes to introducing new stuff in CPUs like 64 bit architecture, good dual core implementation etc,. But was really necessary to have 128-bit FP modules ? How many applications go that far to use 128 bit architecture? Do SSE 2, 3 have capability to use that? (I'm asking because I've no idea). I think they sacrificed too much of the today's CPU foundation just to make it future-ready.

Reply to lockhrt999
- 0 +

iam2thecrowe wrote :

My thoughts are that they should scrap bulldozer and go back to updating their Phenom II core.


Too late to do that now. BD is here. And after so many delays, the pressure to release something, anything, has got to be incredible.

Reply to jsc

jsc wrote :

Too late to do that now. BD is here. And after so many delays, the pressure to release something, anything, has got to be incredible.



AMD's Bulldozer is like the orginal motor cars. Unreliable and not clearly better than the horse, but the motor car (Bulldozer) can be tuned into that beats the horse (K10.5).

------------------------------ Dying
Is an art, like everything else.
I do it exceptionally well.
-Slyvia Plath Lady Lazarus
Reply to amdfangirl

Haserath wrote :

I didn't know Llano had that much of an increase in L2 latency over Phenom II. Llano barely outperforms Athlon II for the huge increase in transistors though. What is the point of doubling the cache if you also nearly double the latency? :pfff:



Because dies with complex GPUs don't scale well?

Haserath wrote :

Sandy Bridges L3 cache latency- 26-31 cycles v. Bulldozer's L2 cache latency 25-27 cycles. Sure it's nearly the same latency, but Sandy Bridge's cache level is one higher!

So a single core has to go through the faster L1, the faster L2, then the L3 cache in Sandy Bridge. Bulldozer only hits its fast L1 before hitting a slow L2; they might as well shoot themselves in the foot right there.

I'm wondering why they made the L2 so big. The core would need good prediction to have small caches with high hit rates. I have seen what branching can do to Bulldozer's performance, and it's not pretty. I don't even think better prediction would increase performance all that much unless it was excellent. The only real increase on per core performance could be predicting instructions to fit into the L1 before they're needed, but the L1 is so small that it would be tough to keep getting correct predictions. Once you're at the L2, you're already done; no performance can be gained through prediction.

I don't think Bulldozer's foundation is strong enough to push Intel. They went for faster frequencies, but with a power/heat wall in their way.

Edit: grammar nazi :)



Yes, it is a real shocker. Probably to accomodate the predictor.

AMD never had a fine tuned one in comparison to Intel. For the Pentium 4, it was essential to survival, Intel's inheritance of such a good predictor makes the Core arch a brillant feat of tech.

------------------------------ Dying
Is an art, like everything else.
I do it exceptionally well.
-Slyvia Plath Lady Lazarus
Reply to amdfangirl

Haserath wrote :

Trinity is supposed to be out in Q1 2012. Hopefully it's released then and hopefully it performs better.

Quote :

The question in my mind is why did the cache latencies increase so much? It must be something to do with the fabrication technology, because logically you wouldn't want to design a chip for high frequencies, then hobble it by increasing cache latencies. So they must have had no choice???


I'm not exactly sure why. It doesn't make sense to me at all. The pentium 4 was left behind due to a flawed concept that higher frequency was better. Sacrificing IPC gain means sacrificing efficiency.



I have heard Q3 2012 somewhere. Either way Trinity and Piledriver will meet IB head on, well not sure performance wise though.

I think the higher latencies actually stem from the longer pipeline. Of course I am no expert, although I have been through many tech classes that talk about this but the math always escaped me.

As for the Pentium 4, don't shun it too much. Intel took some things from NetBurst and put it towards the Core uarch and even SB (the Pentium 4 had trace cache, much like the micro-uOP buffer SB has.

amdfangirl wrote :

Because dies with complex GPUs don't scale well?



Yes, it is a real shocker. Probably to accomodate the predictor.

AMD never had a fine tuned one in comparison to Intel. For the Pentium 4, it was essential to survival, Intel's inheritance of such a good predictor makes the Core arch a brillant feat of tech.



I would say K8 was fine tuned. Not sure they could have sucked more out of it if they tried.

------------------------------ http://valid.canardpc.com/cache/banner/2290513.png
Reply to jimmysmitty
- 0 +

As bad as it may sound, this has similarities to the TLB bug (to me anyway). In 6 months they may have it fixed (just like the TLB bug), it seems to me that they rushed it out before they could fix the pipelines/latency/scaling issues. They are fighting and rushing their own hype (overhyped BD, then rushed release just like Phenom I). BD is a great concept from AMD, but there was so much more they could have done, before release.

I'd rather have the proc delayed, and gotten right. Than rushed, and turned into a CF.

Reply to IH8U

lockhrt999 wrote :

Do you guys think that cache latency was increased due to increased in die size?
In SB due to it's smaller die size everything's cramped together and may be that's why the latency is low. (?)

I'm no pro but I think having only 4 Floating Point calculation modules is what to blame. I use resources heavy applications like Maya, 3ds Max and these applications use FPs for almost everything. Even media encoders use FPs (I'm guessing).
So technically for these applications, it's like having only four cores.
For such application bd is nothing but a four core hyper-threaded (4c/8t) CPU.

Anyway,
1. The six-core BD could out perform eight-core BD, if they physically remove disabled cores and put another 2 FP modules there. I know it can't be done overnight but still a food for thought. :D

2. AMD, most of the time stays ahead of Intel when it comes to introducing new stuff in CPUs like 64 bit architecture, good dual core implementation etc,. But was really necessary to have 128-bit FP modules ? How many applications go that far to use 128 bit architecture? Do SSE 2, 3 have capability to use that? (I'm asking because I've no idea). I think they sacrificed too much of the today's CPU foundation just to make it future-ready.


I believe it's due to the increase in pipeline depth and the size of the caches being increased.

1. Integer perf would go below a Thuban though!

2. 128-bit is actually used quite often in SSE enabled programs. I think it mostly gangs together 32-bit or 64-bit FP operations and executes them at the same time. It's AVX's 256-bit operations that won't be used for quite some time.

Quote :

Because dies with complex GPUs don't scale well?


Maybe...but even with the gpu off it performs just slightly over the Athlon II. It would've been about the same if they just left it like an athlon II.

Quote :

Yes, it is a real shocker. Probably to accomodate the predictor.

AMD never had a fine tuned one in comparison to Intel. For the Pentium 4, it was essential to survival, Intel's inheritance of such a good predictor makes the Core arch a brillant feat of tech.


Oh yes, the prediction unit really catapults Intel ahead. Prediction also needs a finely tuned arch to go with it or it is just a waste- Pentium 4 shows that.

Reply to Haserath
- 0 +

That explains the pipeline stalls quite well. Don't think OS/Bios patches are going to help Bulldozer there. Wait for Trinity!

Reply to Cazalan

Cazalan wrote :

That explains the pipeline stalls quite well. Don't think OS/Bios patches are going to help Bulldozer there. Wait for Trinity!


WAIT! how could anyone still be hoping to WAIT for something good from them. At best its going to be 13% faster according to AMD. Intel's CPU's are already more than 13% faster while only using 4 cores, and they have new CPU's launching any time now. The bulldozer architecture is a flop, i dont think any amount of tweaking can help it. They should just throw it away and start again and just sell higher clocked phenom II's for cheap.

------------------------------ INTEL CORE 2 Q6600 @ 3.49GHz, CM Hyper TX3, ASUS P5N-D, 8GB DDR800 RAM, Powercolour HD6850, 650w Antec Trupower New PSU
Reply to iam2thecrowe

The only Trinity has going for it is an IGP that doesn't suck.

With the possibility it might actually be worse than Llano in performance I'd safely invest in a Sandy Bridge CPU.

Unless you're getting a laptop. Then wait for Ivy Bridge or Trinity.

------------------------------ Dying
Is an art, like everything else.
I do it exceptionally well.
-Slyvia Plath Lady Lazarus
Reply to amdfangirl

iam2thecrowe wrote :

good analysis. My thoughts are that they should scrap bulldozer and go back to updating their Phenom II core. Add faster cache, new sse instructions, could add a couple more cores if they wanted to, and make it in the smaller 32nm porcess, and you would have a better cpu than bulldozer. They could then keep working on the bulldozer core in the background instead of selling the public engineering samples.



I sorta think AMD should try to combine the best features of P2 and BD - 8 full cores, each with a faster 1MB L2 cache, 4-issue decoders, etc. Yeah it would be bigger than 315mm^2, but AMD is used to making huge die CPUs anyway it seems. Or else hurry up 22nm.

Reply to fazers_on_stun

Assuming S/A is accurate, what I find surprising is the lack of adequate modeling tools to predict CPU performance under various loads and debug the design before committing to silicon. The second part of Demerjan's article mentions simulations only able to model part of a CPU for a few clock cycles. So this explains why the engineers first told JF that IPC would be higher according to their simulations, then apparently backed off that statement later.

I suspect this was also the case for the infamous "Barcelona up to 40% faster than Core2 in a wide variety of workloads" statement back in 2007.

I also suspect Intel has better modeling tools available - they seem to hit their CPU performance predictions more often, ever since Conroe anyway..

Reply to fazers_on_stun
- 0 +

I do really think that BD is unfinished. AMD go pressured too much for a release of any CPU and they had to release BD before it was even ready. I am not sure that BD will be a competitive CPU, but I am sure that the next line up will be much better, but only time will tell unless there is an AMD employee around here who wont mind sharing some details with us. ;)

Reply to Rizlla

There WAS an AMD rep here, but ever since the release of Bulldozer he's been MIA.

Reply to phatbuddha79

phatbuddha79 wrote :

There WAS an AMD rep here, but ever since the release of Bulldozer he's been MIA.



That was JF-AMD. He recently posted over in the Anandtech forums about all the hate mail and flames he has been getting, so he has given up posting for the time being anyway, as he does it on his own time and can only post info that has been approved by the engineering team. Which is why he posted many months ago that IPC would be better on BD than prior architecture (P2).

Reply to fazers_on_stun

fazers_on_stun wrote :

Assuming S/A is accurate, what I find surprising is the lack of adequate modeling tools to predict CPU performance under various loads and debug the design before committing to silicon. The second part of Demerjan's article mentions simulations only able to model part of a CPU for a few clock cycles. So this explains why the engineers first told JF that IPC would be higher according to their simulations, then apparently backed off that statement later.

I suspect this was also the case for the infamous "Barcelona up to 40% faster than Core2 in a wide variety of workloads" statement back in 2007.

I also suspect Intel has better modeling tools available - they seem to hit their CPU performance predictions more often, ever since Conroe anyway..



They probably didn't expect power draw to be as big a factor as it was. Put everything else aside, if power draw were lower, and BD was clocked about a gig faster, then we wouldn't be having this discussion.

I'll say it again: Power and heat are the two major limiting factors on speed.

Reply to gamerk316

fazers_on_stun wrote :

That was JF-AMD. He recently posted over in the Anandtech forums about all the hate mail and flames he has been getting, so he has given up posting for the time being anyway, as he does it on his own time and can only post info that has been approved by the engineering team. Which is why he posted many months ago that IPC would be better on BD than prior architecture (P2).



2 possible things happened. Either he lied to us, OR he was lied to.

Reply to phatbuddha79
- 1 +

IH8U wrote :

As bad as it may sound, this has similarities to the TLB bug (to me anyway). In 6 months they may have it fixed (just like the TLB bug), it seems to me that they rushed it out before they could fix the pipelines/latency/scaling issues. They are fighting and rushing their own hype (overhyped BD, then rushed release just like Phenom I). BD is a great concept from AMD, but there was so much more they could have done, before release.

I'd rather have the proc delayed, and gotten right. Than rushed, and turned into a CF.




There are striking similarities between Barcelona and Bulldozer at launch. I mentioned this very point just last night.

The 9600 black editon was a horrible part, and it launched at a price higher than the q6600. It wouldn't clock, it was power hungry, it was 20 to 25%% slower per clock with the TLB patch.

A lot more like bulldozer than I think most people remember. And it matured in to a quality part.

Personally, I would have preferred a die shrunk, 8 core Thuban while Bulldozer matured. But thats water under the bridge.

Reply to FALC0N
- -1 +

iam2thecrowe wrote :

WAIT! how could anyone still be hoping to WAIT for something good from them. At best its going to be 13% faster according to AMD. Intel's CPU's are already more than 13% faster while only using 4 cores, and they have new CPU's launching any time now. The bulldozer architecture is a flop, i dont think any amount of tweaking can help it. They should just throw it away and start again and just sell higher clocked phenom II's for cheap.



Once again, Barcelona and the x4 9600 black edition. Those who don't learn from history............

Reply to FALC0N
- 1 +

fazers_on_stun wrote :

 

I also suspect Intel has better modeling tools available - they seem to hit their CPU performance predictions more often, ever since Conroe anyway..

 

Remember that Intel has been using the same basic execution core since conroe. Lots of tweaks and minor improvements along with the on die controller. Its a lot easier to do that than to redesign the core. AMD had a much more difficult task.

 


Message edited by FALC0N on 10-18-2011 at 09:18:16 PM
Reply to FALC0N

What does it take to make a desktop CPU scalable? If they just, if they just could design a chipset which could accumulate double bulldozer, only then it would be worthwhile to investment in them. I know it's not gonna happen

Workstation CPUs are very expensive. :(


Message edited by lockhrt999 on 10-18-2011 at 09:06:06 PM
Reply to lockhrt999
- 1 +

phatbuddha79 wrote :

2 possible things happened. Either he lied to us, OR he was lied to.



I disagree. Im sure they thought they would have the IPC up beyond Phenom II at launch. They just didn't succeed.

Reply to FALC0N

Quote :

What does it take to make a desktop CPU scalable?



The CPU isn't the problem, software is.

Until you get a programming language that actually lends itself to parallel applications implicity and an OS written in said language, you will almost never reach the dream of decent scalability.

I think you are going to see GPUs come more to the forefront, simply because they are designed to handle massivly parallel applications, unlike CPU's, which still are best handling a few things at one time, but at a very fast speed. They complement eachother.

Reply to gamerk316

I still think hyperthreading is better for the desktop. The processor could use all it's resources for 1 thread or share resources between 2 threads.

Intel has four decoders for one core while AMD has four decoders for one module.

Intel has 3 ALU, AGUs capable of 3 stores and 2 loads, 3 SSE, and 3 FPUs for one of their cores. Bulldozer has 2 ALU, 2 AGU, and 2 FPU for each core when they aren't sharing.

Just from looking at that you can see Intel has an upper hand on resources for a single thread. It's when all integer cores on Bulldozer are being used that Bulldozer can actually come near the 2600k. The only problem is that Intel's arch is much more efficient than Bulldozer thus keeping Sandy Bridge within reach of Bulldozer even with less total resources.

Reply to Haserath

My take is, too many little problems like Charlie suggests, which if unattended to, can cause alot of power and heat problems.
As for JF, things changed on him, as you could tell by his latter posts, where the IPC issue simply wasnt addressed by him, as by the time it was truly ready, he knew, had said too much of planned ideals, and surely couldnt back off and change what hed said earlier, not his fault.
It seems both Intel and AMD do extremely well with a known arch, and Intel starts and finishes a tad longer than AMD, and have more resources to throw at it as well.
That being the case, the new CEO is forming a tick tock of sorts at AMD, will it succeed, well, good thing is, BD arch is a good one, much to build upon, which allows for familiarity, which AMD does do well at.

The sim tools and time arent AMDs friends, plus a slight harder crunch coming from GF, with its early problems, a snowball effect.
My 2c

------------------------------ If we lose this freedom of ours, history will record with the greatest astonishment, those who had the most to lose, did the least to prevent its happening
Reply to JAYDEEJOHN
- 0 +

Haserath wrote :

I still think hyperthreading is better for the desktop. The processor could use all it's resources for 1 thread or share resources between 2 threads.



As can bulldozer when the loads are properly scheduled.

Remember, hyperthreading has to be scheduled to work properly too.

Reply to FALC0N

gamerk316 wrote :

Quote :

What does it take to make a desktop CPU scalable?


The CPU isn't the problem, software is.
Until you get a programming language that actually lends itself to parallel applications implicity and an OS written in said language, you will almost never reach the dream of decent scalability.
I think you are going to see GPUs come more to the forefront, simply because they are designed to handle massivly parallel applications, unlike CPU's, which still are best handling a few things at one time, but at a very fast speed. They complement eachother.



Most of the software I use can scale upto 128 cores or more. That's why I was eagerly hoping for some kick ass performance from BD as I can't afford multi workstation cpu setup right now.
Software like Maya can run on a super computer having thousands of processors.
My needs are a little different.

Reply to lockhrt999

gamerk316 wrote :

They probably didn't expect power draw to be as big a factor as it was. Put everything else aside, if power draw were lower, and BD was clocked about a gig faster, then we wouldn't be having this discussion.

I'll say it again: Power and heat are the two major limiting factors on speed.



I agree. Hopefully AMD can improve the tools and get PD up to speed.

Reply to fazers_on_stun

phatbuddha79 wrote :

2 possible things happened. Either he lied to us, OR he was lied to.



I don't think either one. Last year, before all the problems were known, the engineers probably assured JF that IPC would improve according to their simulations. Then as various problems surfaced, they stopped telling him that. Unfortunately, JF is limited as to what he can tell us, so I think he sorta hinted that maybe IPC wouldn't improve by discounting its importance, and instead emphasizing multi-core performance. After all, he is director of server marketing and thus he is not going to emphasize problems but advantages as much as he can. You just have to read between the lines and pay attention to not only what he does say, but what he doesn't (or no longer) says..

Reply to fazers_on_stun

I'm sure the high latency of the cache is partly due to the manufacturing process and also due to servicing two threads. The large amount of cache was probably to make for latency and for other minor problems with the architecture. Even so performance could be improved if the L2 cache had a much lower latency, even if it had to be split into two smaller 512kb caches as in one for each thread. The L3 cache would also have to be reworked to provide lower latency though it seems odd to me that each module has it's own L3 cache rather than a larger shared L3 they way Phenom and Current Intel Chips do.

Reworking the Fetch and Decode Area as well as reworking the Cache is going to be needed to raise IPC, but even then it may not be enough and that's a pretty tall order for them. I can't even imagine what would have happened if Dirk hadn't stopped the 45nm Bulldozer parts as they would have performed even worse and set back AMD even farther.

------------------------------ Play Brutal Legend Phenom II X4 955 @3.6GHz | GIGABYTE GA-MA790X-DS4 | 4GB Mushkin DDR2 1066 | Plextor 760A | Lite-On BluRay | CF Gigabyte UD 5870x2 | WD 1TB Black| Corsair 950TX | APEVIA X-Dreamer Black | Win XP 64 & Win 7 Pro 64
Reply to megamanx00

lockhrt999 wrote :

Most of the software I use can scale upto 128 cores or more. That's why I was eagerly hoping for some kick ass performance from BD as I can't afford multi workstation cpu setup right now.
Software like Maya can run on a super computer having thousands of processors.
My needs are a little different.





Similar to mine it sounds like. Because Maya, etc are hardcore apps, they will get the XOP\AVX\FMAC treatment first. That's where BD shines RIGHT NOW.

After reading several posts, MEH, not going to bother with this. I just find it amazing when people become CPU gurus. Almost like over night.

------------------------------ Why can't men be groupies? Actresses need love too!
BaronMatrix
Reply to BaronMatrix

fazers_on_stun wrote :

I don't think either one. Last year, before all the problems were known, the engineers probably assured JF that IPC would improve according to their simulations. Then as various problems surfaced, they stopped telling him that. Unfortunately, JF is limited as to what he can tell us, so I think he sorta hinted that maybe IPC wouldn't improve by discounting its importance, and instead emphasizing multi-core performance. After all, he is director of server marketing and thus he is not going to emphasize problems but advantages as much as he can. You just have to read between the lines and pay attention to not only what he does say, but what he doesn't (or no longer) says..





JF knows crap all about desktop. He was always talking about SERVER.

------------------------------ Why can't men be groupies? Actresses need love too!
BaronMatrix
Reply to BaronMatrix

megamanx00 wrote :

I'm sure the high latency of the cache is partly due to the manufacturing process and also due to servicing two threads. The large amount of cache was probably to make for latency and for other minor problems with the architecture. Even so performance could be improved if the L2 cache had a much lower latency, even if it had to be split into two smaller 512kb caches as in one for each thread. The L3 cache would also have to be reworked to provide lower latency though it seems odd to me that each module has it's own L3 cache rather than a larger shared L3 they way Phenom and Current Intel Chips do.

Reworking the Fetch and Decode Area as well as reworking the Cache is going to be needed to raise IPC, but even then it may not be enough and that's a pretty tall order for them. I can't even imagine what would have happened if Dirk hadn't stopped the 45nm Bulldozer parts as they would have performed even worse and set back AMD even farther.


I looked that up, and I can't believe that each module can only use 2MB of the L3 each. The L3 is totally inclusive of the L2 to prevent major snoop traffic in the cores, so it helps highly threaded workloads but sucks on lightly threaded workloads. AMD needs to rethink their cache.

Perhaps the 4150 will scale a little better if each module can use 4MB then.

Dirk was wise to stop this from being released earlier; it's still not ready. Now I'm thinking even Piledriver won't fix enough flaws to make it compete against Ivy Bridge; the foundation is rotten...

I see why Cray bought up the chips though; it really is a good HPC chip meant for very parallel workloads.

CPUs aren't as good as GPUs for parallel though. They should stick with serial strength on the CPU and parallel strength on the GPU with their APU strategy.

Reply to Haserath

BaronMatrix wrote :

Similar to mine it sounds like. Because Maya, etc are hardcore apps, they will get the XOP\AVX\FMAC treatment first. That's where BD shines RIGHT NOW.

After reading several posts, MEH, not going to bother with this. I just find it amazing when people become CPU gurus. Almost like over night.


AMD will definitely pull ahead when their new instructions are used, but for most of us, they won't be used much. I would love to see real world usage with the FMAC instructions on a 8150 against a 2600k; I would bet the 8150 is at least 30% faster then.

Reply to Haserath

This discussion of the BD failings is all well and good; however, is there a realistic way to improve upon this BD architecture and have a viable desktop competitor to SB in the Windows 7, 8 market?

Reply to guskline
- 0 +

guskline wrote :

This discussion of the BD failings is all well and good; however, is there a realistic way to improve upon this BD architecture and have a viable desktop competitor to SB in the Windows 7, 8 market?



Of course there is. As the article so eloquently stated:

"The difference between improving IPC a bit and losing double digit percentages is a very fine line, and the devil is in the details "

That's pretty much true of the whole arch. It has real potential.

Reply to FALC0N

I notice in another blog the announcement of a B3 revision linked to an errata page.Since the BD B2 was just released, how long will it take for a B3 release? Or will this follow the path of the original Phenom and then the Phenom II?

Reply to guskline

guskline wrote :

This discussion of the BD failings is all well and good; however, is there a realistic way to improve upon this BD architecture and have a viable desktop competitor to SB in the Windows 7, 8 market?


Quote :

Of course there is. As the article so eloquently stated:

"The difference between improving IPC a bit and losing double digit percentages is a very fine line, and the devil is in the details "

That's pretty much true of the whole arch. It has real potential.


Agreed, there are quite a few little mishaps that are in the first iteration of Bulldozer's architecture. Fix those and it should be improved quite a bit over this first gen version.

Quote :

I notice in another blog the announcement of a B3 revision linked to an errata page.Since the BD B2 was just released, how long will it take for a B3 release? Or will this follow the path of the original Phenom and then the Phenom II?


This is most likely the fix we've been hearing about for Bulldozer. B2 has been around for some time, so B3 might be able to come out in late Q4. I would expect it to be more of an early Q1 release before Ivy Bridge comes out.

Reply to Haserath

BaronMatrix wrote :

Similar to mine it sounds like. Because Maya, etc are hardcore apps, they will get the XOP\AVX\FMAC treatment first. That's where BD shines RIGHT NOW.

After reading several posts, MEH, not going to bother with this. I just find it amazing when people become CPU gurus. Almost like over night.


I would actually like comments from everybody on this if you don't mind. ;)

What is the difference between the XOP and AVX codepath for Bulldozer? I've heard that it performs just about the same, but I don't know much about XOP.

Reply to Haserath
Register or log in to remove.
Previous
1 2 3
Tom's Hardware > Forum > CPU & Components > CPUs > Bulldozer's IPC Death- "1000 little cuts"
Go to:

There are 2008 identified and unidentified users. To see the list of identified users, Click here.

Please mind

You are about to answer a thread that has been inactive for more than 6 months.
If you still wish to proceed, please ensure that your posting is original and does not duplicate or overlap any prior responses to this thread.

Add a reply Cancel
  • Ask the community now
  • Publish
Ad
Ads
Latest best answer
Case with filters. Recommendations?
By al360ex, 7 hours ago:

Then I'd go with one of these cases. If you choose the HAF 932 Advanced Edition, you...

Best offers
They won a badge
Join us in greeting them
Top experts