Sign in with
Sign up | Sign in
Your question

Bulldozer's IPC Death- "1000 little cuts"

Last response: in CPUs
Share
a b à CPUs
October 18, 2011 1:42:47 AM

Now that Bulldozer is out in the wild, and we've had a few days to digest what it is, this thread will be about Bulldozer's foundation and what it might lead to.

I have two articles for everyone to read before they start the flamewars ;)  :
http://semiaccurate.com/2011/10/17/why-did-bulldozer-un...
http://semiaccurate.com/2011/10/17/bulldozer-doesnt-hav...
---------------
It seems that Bulldozer is a very unfinished product as many forum members have been saying.

There are many quirks with Bulldozer that I do not like. Charlie seems to have the same idea on a few things as I do.

The decode stage is up with 4 decoders to feed the cores, but overall the decoded instructions is down per core. It was 3 in Phenom II and is now 2 in Bulldozer. I thought that this would be no big deal mostly due to improvements elsewhere, but that didn't happen. Only two math instructions happen in the integers at one time in a core compared to 3 max in Phenom. Usually this isn't a big deal; the program might not use that third out of order integer core, but it can affect performance in situations that it does.

L1 and L2 cache latency is up; not just a little. L1 is up 1 cycle, that is a 33% increase in latency for a smaller L1 cache. L2 cache is around 25-27 cycles, a near 70% increase over Phenom! What were they thinking allowing these terrible latencies in Bulldozer's design? I agree with Charlie, they should have made the L2 cache at 1MB with lower latency. Instead they could've had that smaller L2 with the L3 as fillover for any extra instructions needed, much like Sandy Bridge.

It also appears that Glofo isn't totally to blame for Bulldozer's woes on clockspeed and power. There needs to be some fine tuning to the architecture before we see what Bulldozer was really meant to be speed wise; Piledriver might just be this finely tuned Bulldozer.
------
What does everyone think about Bulldozer?

I would like to talk about the architecture if anyone wants to get down and dirty with me. :sol: 
October 18, 2011 1:55:44 AM

If things designed in an arch were made to do their job and shut down, but instead, it has to work harder, temps and clocks can be effected.
The cache cycles dont look good, and its slow anyways, and both Llano and BD have the same amount of cycles on the L2, and both have reached their curve power/perf, and its alot of uphill from there for power and heat, as the perf curve drops off
a c 127 à CPUs
October 18, 2011 1:59:44 AM

I will say this once, keep it clean people. No name calling, attacks or majorly off topic. I am not too worried about off topic, its the other stuff that annoys me. We are all adults (well maybe not all of us but we should act it), act like it. Tired of threads getting closed. I will try to stay on top of it but so help me Jebus......

With that out of the way, I agree. For some reason AMD wants a large L2 cache thats slower. Intel dropped large L2 cache with Nehalem as they have a large L3 cache instead. And I don't think they would need a fillover L3 but more that they could easily do what Intel does, use the larger L3 as a buffer for the instructions so it doesn't have to recall it from the much slower system RAM if it needs it again.

Overall, BD is not impressive. Thats because of the hype. But I can see the potential in the idea if AMD can pull a "Deneb" with it. Still I fear it wont push Intel enough.

No GF is not the only one to blame as they sort of "inherited" AMDs 32nm process. I also think that Piledriver will be the BD that was meant to be, too bad it will have to face SB-E and IB instead of SB.
October 18, 2011 2:06:20 AM

Nice assessment
There are differences between the 2 chips, and is probably why the cache size/heiarchy is different.
Alot can be done, I agree, and remember, this is first gen 32nm, and a whole new chip, a whole new approach.
I think theyll find what looks good in sims are different than real usage, and those adjustments will come
a b à CPUs
October 18, 2011 2:08:31 AM

JAYDEEJOHN said:
If things designed in an arch were made to do their job and shut down, but instead, it has to work harder, temps and clocks can be effected.
The cache cycles dont look good, and its slow anyways, and both Llano and BD have the same amount of cycles on the L2, and both have reached their curve power/perf, and its alot of uphill from there for power and heat, as the perf curve drops off

I didn't know Llano had that much of an increase in L2 latency over Phenom II. Llano barely outperforms Athlon II for the huge increase in transistors though. What is the point of doubling the cache if you also nearly double the latency? :pfff: 
a b à CPUs
October 18, 2011 2:33:59 AM

jimmysmitty said:
I will say this once, keep it clean people. No name calling, attacks or majorly off topic. I am not too worried about off topic, its the other stuff that annoys me. We are all adults (well maybe not all of us but we should act it), act like it. Tired of threads getting closed. I will try to stay on top of it but so help me Jebus......

With that out of the way, I agree. For some reason AMD wants a large L2 cache thats slower. Intel dropped large L2 cache with Nehalem as they have a large L3 cache instead. And I don't think they would need a fillover L3 but more that they could easily do what Intel does, use the larger L3 as a buffer for the instructions so it doesn't have to recall it from the much slower system RAM if it needs it again.

Overall, BD is not impressive. Thats because of the hype. But I can see the potential in the idea if AMD can pull a "Deneb" with it. Still I fear it wont push Intel enough.

No GF is not the only one to blame as they sort of "inherited" AMDs 32nm process. I also think that Piledriver will be the BD that was meant to be, too bad it will have to face SB-E and IB instead of SB.

Sandy Bridges L3 cache latency- 26-31 cycles v. Bulldozer's L2 cache latency 25-27 cycles. Sure it's nearly the same latency, but Sandy Bridge's cache level is one higher!

So a single core has to go through the faster L1, the faster L2, then the L3 cache in Sandy Bridge. Bulldozer only hits its fast L1 before hitting a slow L2; they might as well shoot themselves in the foot right there.

I'm wondering why they made the L2 so big. The core would need good prediction to have small caches with high hit rates. I have seen what branching can do to Bulldozer's performance, and it's not pretty. I don't even think better prediction would increase performance all that much unless it was excellent. The only real increase on per core performance could be predicting instructions to fit into the L1 before they're needed, but the L1 is so small that it would be tough to keep getting correct predictions. Once you're at the L2, you're already done; no performance can be gained through prediction.

I don't think Bulldozer's foundation is strong enough to push Intel. They went for faster frequencies, but with a power/heat wall in their way.

Edit: grammar nazi :) 
a c 88 à CPUs
October 18, 2011 2:46:43 AM

good analysis. My thoughts are that they should scrap bulldozer and go back to updating their Phenom II core. Add faster cache, new sse instructions, could add a couple more cores if they wanted to, and make it in the smaller 32nm porcess, and you would have a better cpu than bulldozer. They could then keep working on the bulldozer core in the background instead of selling the public engineering samples.
a b à CPUs
October 18, 2011 2:59:23 AM

iam2thecrowe said:
good analysis. My thoughts are that they should scrap bulldozer and go back to updating their Phenom II core. Add faster cache, new sse instructions, could add a couple more cores if they wanted to, and make it in the smaller 32nm porcess, and you would have a better cpu than bulldozer. They could then keep working on the bulldozer core in the background instead of selling the public engineering samples.

That would in fact improve performance over Bulldozer.

Increase L3 size while decreasing the cache latencies. Improve prediction for the core. Make the cache inclusive v. exclusive. Increase core count to 8. Bam, just made a Phenom III that outperforms Bulldozer.
October 18, 2011 3:15:09 AM

Haserath said:
That would in fact improve performance over Bulldozer.

Increase L3 size while decreasing the cache latencies. Improve prediction for the core. Make the cache inclusive v. exclusive. Increase core count to 8. Bam, just made a Phenom III that outperforms Bulldozer.
The question in my mind is why did the cache latencies increase so much? It must be something to do with the fabrication technology, because logically you wouldn't want to design a chip for high frequencies, then hobble it by increasing cache latencies. So they must have had no choice???
October 18, 2011 3:17:33 AM

When do people think Piledriver variants will be released?
a b à CPUs
October 18, 2011 4:13:53 AM

Chad Boga said:
When do people think Piledriver variants will be released?

Trinity is supposed to be out in Q1 2012. Hopefully it's released then and hopefully it performs better.
Quote:
The question in my mind is why did the cache latencies increase so much? It must be something to do with the fabrication technology, because logically you wouldn't want to design a chip for high frequencies, then hobble it by increasing cache latencies. So they must have had no choice???

I'm not exactly sure why. It doesn't make sense to me at all. The pentium 4 was left behind due to a flawed concept that higher frequency was better. Sacrificing IPC gain means sacrificing efficiency.
October 18, 2011 4:40:05 AM

Do you guys think that cache latency was increased due to increased in die size?
In SB due to it's smaller die size everything's cramped together and may be that's why the latency is low. (?)

I'm no pro but I think having only 4 Floating Point calculation modules is what to blame. I use resources heavy applications like Maya, 3ds Max and these applications use FPs for almost everything. Even media encoders use FPs (I'm guessing).
So technically for these applications, it's like having only four cores.
For such application bd is nothing but a four core hyper-threaded (4c/8t) CPU.

Anyway,
1. The six-core BD could out perform eight-core BD, if they physically remove disabled cores and put another 2 FP modules there. I know it can't be done overnight but still a food for thought. :D 

2. AMD, most of the time stays ahead of Intel when it comes to introducing new stuff in CPUs like 64 bit architecture, good dual core implementation etc,. But was really necessary to have 128-bit FP modules ? How many applications go that far to use 128 bit architecture? Do SSE 2, 3 have capability to use that? (I'm asking because I've no idea). I think they sacrificed too much of the today's CPU foundation just to make it future-ready.
a c 172 à CPUs
October 18, 2011 5:25:10 AM

iam2thecrowe said:
My thoughts are that they should scrap bulldozer and go back to updating their Phenom II core.

Too late to do that now. BD is here. And after so many delays, the pressure to release something, anything, has got to be incredible.
a b à CPUs
October 18, 2011 5:36:23 AM

jsc said:
Too late to do that now. BD is here. And after so many delays, the pressure to release something, anything, has got to be incredible.


AMD's Bulldozer is like the orginal motor cars. Unreliable and not clearly better than the horse, but the motor car (Bulldozer) can be tuned into that beats the horse (K10.5).
a b à CPUs
October 18, 2011 5:39:39 AM

Haserath said:
I didn't know Llano had that much of an increase in L2 latency over Phenom II. Llano barely outperforms Athlon II for the huge increase in transistors though. What is the point of doubling the cache if you also nearly double the latency? :pfff: 


Because dies with complex GPUs don't scale well?

Haserath said:
Sandy Bridges L3 cache latency- 26-31 cycles v. Bulldozer's L2 cache latency 25-27 cycles. Sure it's nearly the same latency, but Sandy Bridge's cache level is one higher!

So a single core has to go through the faster L1, the faster L2, then the L3 cache in Sandy Bridge. Bulldozer only hits its fast L1 before hitting a slow L2; they might as well shoot themselves in the foot right there.

I'm wondering why they made the L2 so big. The core would need good prediction to have small caches with high hit rates. I have seen what branching can do to Bulldozer's performance, and it's not pretty. I don't even think better prediction would increase performance all that much unless it was excellent. The only real increase on per core performance could be predicting instructions to fit into the L1 before they're needed, but the L1 is so small that it would be tough to keep getting correct predictions. Once you're at the L2, you're already done; no performance can be gained through prediction.

I don't think Bulldozer's foundation is strong enough to push Intel. They went for faster frequencies, but with a power/heat wall in their way.

Edit: grammar nazi :) 


Yes, it is a real shocker. Probably to accomodate the predictor.

AMD never had a fine tuned one in comparison to Intel. For the Pentium 4, it was essential to survival, Intel's inheritance of such a good predictor makes the Core arch a brillant feat of tech.
a c 127 à CPUs
October 18, 2011 6:07:27 AM

Haserath said:
Trinity is supposed to be out in Q1 2012. Hopefully it's released then and hopefully it performs better.
Quote:
The question in my mind is why did the cache latencies increase so much? It must be something to do with the fabrication technology, because logically you wouldn't want to design a chip for high frequencies, then hobble it by increasing cache latencies. So they must have had no choice???

I'm not exactly sure why. It doesn't make sense to me at all. The pentium 4 was left behind due to a flawed concept that higher frequency was better. Sacrificing IPC gain means sacrificing efficiency.


I have heard Q3 2012 somewhere. Either way Trinity and Piledriver will meet IB head on, well not sure performance wise though.

I think the higher latencies actually stem from the longer pipeline. Of course I am no expert, although I have been through many tech classes that talk about this but the math always escaped me.

As for the Pentium 4, don't shun it too much. Intel took some things from NetBurst and put it towards the Core uarch and even SB (the Pentium 4 had trace cache, much like the micro-uOP buffer SB has.

amdfangirl said:
Because dies with complex GPUs don't scale well?



Yes, it is a real shocker. Probably to accomodate the predictor.

AMD never had a fine tuned one in comparison to Intel. For the Pentium 4, it was essential to survival, Intel's inheritance of such a good predictor makes the Core arch a brillant feat of tech.


I would say K8 was fine tuned. Not sure they could have sucked more out of it if they tried.
October 18, 2011 6:22:06 AM

As bad as it may sound, this has similarities to the TLB bug (to me anyway). In 6 months they may have it fixed (just like the TLB bug), it seems to me that they rushed it out before they could fix the pipelines/latency/scaling issues. They are fighting and rushing their own hype (overhyped BD, then rushed release just like Phenom I). BD is a great concept from AMD, but there was so much more they could have done, before release.

I'd rather have the proc delayed, and gotten right. Than rushed, and turned into a CF.
a b à CPUs
October 18, 2011 7:03:54 AM

lockhrt999 said:
Do you guys think that cache latency was increased due to increased in die size?
In SB due to it's smaller die size everything's cramped together and may be that's why the latency is low. (?)

I'm no pro but I think having only 4 Floating Point calculation modules is what to blame. I use resources heavy applications like Maya, 3ds Max and these applications use FPs for almost everything. Even media encoders use FPs (I'm guessing).
So technically for these applications, it's like having only four cores.
For such application bd is nothing but a four core hyper-threaded (4c/8t) CPU.

Anyway,
1. The six-core BD could out perform eight-core BD, if they physically remove disabled cores and put another 2 FP modules there. I know it can't be done overnight but still a food for thought. :D 

2. AMD, most of the time stays ahead of Intel when it comes to introducing new stuff in CPUs like 64 bit architecture, good dual core implementation etc,. But was really necessary to have 128-bit FP modules ? How many applications go that far to use 128 bit architecture? Do SSE 2, 3 have capability to use that? (I'm asking because I've no idea). I think they sacrificed too much of the today's CPU foundation just to make it future-ready.

I believe it's due to the increase in pipeline depth and the size of the caches being increased.

1. Integer perf would go below a Thuban though!

2. 128-bit is actually used quite often in SSE enabled programs. I think it mostly gangs together 32-bit or 64-bit FP operations and executes them at the same time. It's AVX's 256-bit operations that won't be used for quite some time.
Quote:
Because dies with complex GPUs don't scale well?

Maybe...but even with the gpu off it performs just slightly over the Athlon II. It would've been about the same if they just left it like an athlon II.
Quote:
Yes, it is a real shocker. Probably to accomodate the predictor.

AMD never had a fine tuned one in comparison to Intel. For the Pentium 4, it was essential to survival, Intel's inheritance of such a good predictor makes the Core arch a brillant feat of tech.

Oh yes, the prediction unit really catapults Intel ahead. Prediction also needs a finely tuned arch to go with it or it is just a waste- Pentium 4 shows that.
a b à CPUs
October 18, 2011 7:06:48 AM

That explains the pipeline stalls quite well. Don't think OS/Bios patches are going to help Bulldozer there. Wait for Trinity!
a c 88 à CPUs
October 18, 2011 10:05:07 AM

Cazalan said:
That explains the pipeline stalls quite well. Don't think OS/Bios patches are going to help Bulldozer there. Wait for Trinity!

WAIT! how could anyone still be hoping to WAIT for something good from them. At best its going to be 13% faster according to AMD. Intel's CPU's are already more than 13% faster while only using 4 cores, and they have new CPU's launching any time now. The bulldozer architecture is a flop, i dont think any amount of tweaking can help it. They should just throw it away and start again and just sell higher clocked phenom II's for cheap.
a b à CPUs
October 18, 2011 10:15:41 AM

The only Trinity has going for it is an IGP that doesn't suck.

With the possibility it might actually be worse than Llano in performance I'd safely invest in a Sandy Bridge CPU.

Unless you're getting a laptop. Then wait for Ivy Bridge or Trinity.
a b à CPUs
October 18, 2011 1:20:48 PM

iam2thecrowe said:
good analysis. My thoughts are that they should scrap bulldozer and go back to updating their Phenom II core. Add faster cache, new sse instructions, could add a couple more cores if they wanted to, and make it in the smaller 32nm porcess, and you would have a better cpu than bulldozer. They could then keep working on the bulldozer core in the background instead of selling the public engineering samples.


I sorta think AMD should try to combine the best features of P2 and BD - 8 full cores, each with a faster 1MB L2 cache, 4-issue decoders, etc. Yeah it would be bigger than 315mm^2, but AMD is used to making huge die CPUs anyway it seems. Or else hurry up 22nm.
a b à CPUs
October 18, 2011 1:43:41 PM

Assuming S/A is accurate, what I find surprising is the lack of adequate modeling tools to predict CPU performance under various loads and debug the design before committing to silicon. The second part of Demerjan's article mentions simulations only able to model part of a CPU for a few clock cycles. So this explains why the engineers first told JF that IPC would be higher according to their simulations, then apparently backed off that statement later.

I suspect this was also the case for the infamous "Barcelona up to 40% faster than Core2 in a wide variety of workloads" statement back in 2007.

I also suspect Intel has better modeling tools available - they seem to hit their CPU performance predictions more often, ever since Conroe anyway..

October 18, 2011 2:35:25 PM

I do really think that BD is unfinished. AMD go pressured too much for a release of any CPU and they had to release BD before it was even ready. I am not sure that BD will be a competitive CPU, but I am sure that the next line up will be much better, but only time will tell unless there is an AMD employee around here who wont mind sharing some details with us. ;) 
October 18, 2011 4:43:07 PM

There WAS an AMD rep here, but ever since the release of Bulldozer he's been MIA.
a b à CPUs
October 18, 2011 5:04:44 PM

phatbuddha79 said:
There WAS an AMD rep here, but ever since the release of Bulldozer he's been MIA.


That was JF-AMD. He recently posted over in the Anandtech forums about all the hate mail and flames he has been getting, so he has given up posting for the time being anyway, as he does it on his own time and can only post info that has been approved by the engineering team. Which is why he posted many months ago that IPC would be better on BD than prior architecture (P2).

a b à CPUs
October 18, 2011 6:07:08 PM

fazers_on_stun said:
Assuming S/A is accurate, what I find surprising is the lack of adequate modeling tools to predict CPU performance under various loads and debug the design before committing to silicon. The second part of Demerjan's article mentions simulations only able to model part of a CPU for a few clock cycles. So this explains why the engineers first told JF that IPC would be higher according to their simulations, then apparently backed off that statement later.

I suspect this was also the case for the infamous "Barcelona up to 40% faster than Core2 in a wide variety of workloads" statement back in 2007.

I also suspect Intel has better modeling tools available - they seem to hit their CPU performance predictions more often, ever since Conroe anyway..


They probably didn't expect power draw to be as big a factor as it was. Put everything else aside, if power draw were lower, and BD was clocked about a gig faster, then we wouldn't be having this discussion.

I'll say it again: Power and heat are the two major limiting factors on speed.
October 18, 2011 6:16:05 PM

fazers_on_stun said:
That was JF-AMD. He recently posted over in the Anandtech forums about all the hate mail and flames he has been getting, so he has given up posting for the time being anyway, as he does it on his own time and can only post info that has been approved by the engineering team. Which is why he posted many months ago that IPC would be better on BD than prior architecture (P2).


2 possible things happened. Either he lied to us, OR he was lied to.
a b à CPUs
October 18, 2011 6:38:20 PM

IH8U said:
As bad as it may sound, this has similarities to the TLB bug (to me anyway). In 6 months they may have it fixed (just like the TLB bug), it seems to me that they rushed it out before they could fix the pipelines/latency/scaling issues. They are fighting and rushing their own hype (overhyped BD, then rushed release just like Phenom I). BD is a great concept from AMD, but there was so much more they could have done, before release.

I'd rather have the proc delayed, and gotten right. Than rushed, and turned into a CF.



There are striking similarities between Barcelona and Bulldozer at launch. I mentioned this very point just last night.

The 9600 black editon was a horrible part, and it launched at a price higher than the q6600. It wouldn't clock, it was power hungry, it was 20 to 25%% slower per clock with the TLB patch.

A lot more like bulldozer than I think most people remember. And it matured in to a quality part.

Personally, I would have preferred a die shrunk, 8 core Thuban while Bulldozer matured. But thats water under the bridge.
a b à CPUs
October 18, 2011 6:39:52 PM

iam2thecrowe said:
WAIT! how could anyone still be hoping to WAIT for something good from them. At best its going to be 13% faster according to AMD. Intel's CPU's are already more than 13% faster while only using 4 cores, and they have new CPU's launching any time now. The bulldozer architecture is a flop, i dont think any amount of tweaking can help it. They should just throw it away and start again and just sell higher clocked phenom II's for cheap.


Once again, Barcelona and the x4 9600 black edition. Those who don't learn from history............
a b à CPUs
October 18, 2011 6:43:33 PM

fazers_on_stun said:


I also suspect Intel has better modeling tools available - they seem to hit their CPU performance predictions more often, ever since Conroe anyway..


Remember that Intel has been using the same basic execution core since conroe. Lots of tweaks and minor improvements along with the on die controller. Its a lot easier to do that than to redesign the core. AMD had a much more difficult task.

October 18, 2011 6:59:01 PM

What does it take to make a desktop CPU scalable? If they just, if they just could design a chipset which could accumulate double bulldozer, only then it would be worthwhile to investment in them. I know it's not gonna happen

Workstation CPUs are very expensive. :( 
a b à CPUs
October 18, 2011 7:16:55 PM

phatbuddha79 said:
2 possible things happened. Either he lied to us, OR he was lied to.


I disagree. Im sure they thought they would have the IPC up beyond Phenom II at launch. They just didn't succeed.
a b à CPUs
October 18, 2011 7:19:45 PM

Quote:
What does it take to make a desktop CPU scalable?


The CPU isn't the problem, software is.

Until you get a programming language that actually lends itself to parallel applications implicity and an OS written in said language, you will almost never reach the dream of decent scalability.

I think you are going to see GPUs come more to the forefront, simply because they are designed to handle massivly parallel applications, unlike CPU's, which still are best handling a few things at one time, but at a very fast speed. They complement eachother.
a b à CPUs
October 18, 2011 7:34:13 PM

I still think hyperthreading is better for the desktop. The processor could use all it's resources for 1 thread or share resources between 2 threads.

Intel has four decoders for one core while AMD has four decoders for one module.

Intel has 3 ALU, AGUs capable of 3 stores and 2 loads, 3 SSE, and 3 FPUs for one of their cores. Bulldozer has 2 ALU, 2 AGU, and 2 FPU for each core when they aren't sharing.

Just from looking at that you can see Intel has an upper hand on resources for a single thread. It's when all integer cores on Bulldozer are being used that Bulldozer can actually come near the 2600k. The only problem is that Intel's arch is much more efficient than Bulldozer thus keeping Sandy Bridge within reach of Bulldozer even with less total resources.
October 18, 2011 7:35:04 PM

My take is, too many little problems like Charlie suggests, which if unattended to, can cause alot of power and heat problems.
As for JF, things changed on him, as you could tell by his latter posts, where the IPC issue simply wasnt addressed by him, as by the time it was truly ready, he knew, had said too much of planned ideals, and surely couldnt back off and change what hed said earlier, not his fault.
It seems both Intel and AMD do extremely well with a known arch, and Intel starts and finishes a tad longer than AMD, and have more resources to throw at it as well.
That being the case, the new CEO is forming a tick tock of sorts at AMD, will it succeed, well, good thing is, BD arch is a good one, much to build upon, which allows for familiarity, which AMD does do well at.

The sim tools and time arent AMDs friends, plus a slight harder crunch coming from GF, with its early problems, a snowball effect.
My 2c
a b à CPUs
October 18, 2011 8:19:42 PM

Haserath said:
I still think hyperthreading is better for the desktop. The processor could use all it's resources for 1 thread or share resources between 2 threads.


As can bulldozer when the loads are properly scheduled.

Remember, hyperthreading has to be scheduled to work properly too.
October 18, 2011 8:32:11 PM

gamerk316 said:
Quote:
What does it take to make a desktop CPU scalable?

The CPU isn't the problem, software is.
Until you get a programming language that actually lends itself to parallel applications implicity and an OS written in said language, you will almost never reach the dream of decent scalability.
I think you are going to see GPUs come more to the forefront, simply because they are designed to handle massivly parallel applications, unlike CPU's, which still are best handling a few things at one time, but at a very fast speed. They complement eachother.


Most of the software I use can scale upto 128 cores or more. That's why I was eagerly hoping for some kick ass performance from BD as I can't afford multi workstation cpu setup right now.
Software like Maya can run on a super computer having thousands of processors.
My needs are a little different.
a b à CPUs
October 18, 2011 10:21:43 PM

gamerk316 said:
They probably didn't expect power draw to be as big a factor as it was. Put everything else aside, if power draw were lower, and BD was clocked about a gig faster, then we wouldn't be having this discussion.

I'll say it again: Power and heat are the two major limiting factors on speed.


I agree. Hopefully AMD can improve the tools and get PD up to speed.
a b à CPUs
October 18, 2011 10:26:39 PM

phatbuddha79 said:
2 possible things happened. Either he lied to us, OR he was lied to.


I don't think either one. Last year, before all the problems were known, the engineers probably assured JF that IPC would improve according to their simulations. Then as various problems surfaced, they stopped telling him that. Unfortunately, JF is limited as to what he can tell us, so I think he sorta hinted that maybe IPC wouldn't improve by discounting its importance, and instead emphasizing multi-core performance. After all, he is director of server marketing and thus he is not going to emphasize problems but advantages as much as he can. You just have to read between the lines and pay attention to not only what he does say, but what he doesn't (or no longer) says..
a b à CPUs
October 18, 2011 10:31:52 PM

I'm sure the high latency of the cache is partly due to the manufacturing process and also due to servicing two threads. The large amount of cache was probably to make for latency and for other minor problems with the architecture. Even so performance could be improved if the L2 cache had a much lower latency, even if it had to be split into two smaller 512kb caches as in one for each thread. The L3 cache would also have to be reworked to provide lower latency though it seems odd to me that each module has it's own L3 cache rather than a larger shared L3 they way Phenom and Current Intel Chips do.

Reworking the Fetch and Decode Area as well as reworking the Cache is going to be needed to raise IPC, but even then it may not be enough and that's a pretty tall order for them. I can't even imagine what would have happened if Dirk hadn't stopped the 45nm Bulldozer parts as they would have performed even worse and set back AMD even farther.
October 18, 2011 11:05:32 PM

lockhrt999 said:
Most of the software I use can scale upto 128 cores or more. That's why I was eagerly hoping for some kick ass performance from BD as I can't afford multi workstation cpu setup right now.
Software like Maya can run on a super computer having thousands of processors.
My needs are a little different.




Similar to mine it sounds like. Because Maya, etc are hardcore apps, they will get the XOP\AVX\FMAC treatment first. That's where BD shines RIGHT NOW.

After reading several posts, MEH, not going to bother with this. I just find it amazing when people become CPU gurus. Almost like over night.
October 18, 2011 11:06:19 PM

fazers_on_stun said:
I don't think either one. Last year, before all the problems were known, the engineers probably assured JF that IPC would improve according to their simulations. Then as various problems surfaced, they stopped telling him that. Unfortunately, JF is limited as to what he can tell us, so I think he sorta hinted that maybe IPC wouldn't improve by discounting its importance, and instead emphasizing multi-core performance. After all, he is director of server marketing and thus he is not going to emphasize problems but advantages as much as he can. You just have to read between the lines and pay attention to not only what he does say, but what he doesn't (or no longer) says..




JF knows crap all about desktop. He was always talking about SERVER.
a b à CPUs
October 18, 2011 11:15:11 PM

megamanx00 said:
I'm sure the high latency of the cache is partly due to the manufacturing process and also due to servicing two threads. The large amount of cache was probably to make for latency and for other minor problems with the architecture. Even so performance could be improved if the L2 cache had a much lower latency, even if it had to be split into two smaller 512kb caches as in one for each thread. The L3 cache would also have to be reworked to provide lower latency though it seems odd to me that each module has it's own L3 cache rather than a larger shared L3 they way Phenom and Current Intel Chips do.

Reworking the Fetch and Decode Area as well as reworking the Cache is going to be needed to raise IPC, but even then it may not be enough and that's a pretty tall order for them. I can't even imagine what would have happened if Dirk hadn't stopped the 45nm Bulldozer parts as they would have performed even worse and set back AMD even farther.

I looked that up, and I can't believe that each module can only use 2MB of the L3 each. The L3 is totally inclusive of the L2 to prevent major snoop traffic in the cores, so it helps highly threaded workloads but sucks on lightly threaded workloads. AMD needs to rethink their cache.

Perhaps the 4150 will scale a little better if each module can use 4MB then.

Dirk was wise to stop this from being released earlier; it's still not ready. Now I'm thinking even Piledriver won't fix enough flaws to make it compete against Ivy Bridge; the foundation is rotten...

I see why Cray bought up the chips though; it really is a good HPC chip meant for very parallel workloads.

CPUs aren't as good as GPUs for parallel though. They should stick with serial strength on the CPU and parallel strength on the GPU with their APU strategy.
a b à CPUs
October 18, 2011 11:21:19 PM

BaronMatrix said:
Similar to mine it sounds like. Because Maya, etc are hardcore apps, they will get the XOP\AVX\FMAC treatment first. That's where BD shines RIGHT NOW.

After reading several posts, MEH, not going to bother with this. I just find it amazing when people become CPU gurus. Almost like over night.

AMD will definitely pull ahead when their new instructions are used, but for most of us, they won't be used much. I would love to see real world usage with the FMAC instructions on a 8150 against a 2600k; I would bet the 8150 is at least 30% faster then.
October 18, 2011 11:22:24 PM

This discussion of the BD failings is all well and good; however, is there a realistic way to improve upon this BD architecture and have a viable desktop competitor to SB in the Windows 7, 8 market?
a b à CPUs
October 18, 2011 11:36:57 PM

guskline said:
This discussion of the BD failings is all well and good; however, is there a realistic way to improve upon this BD architecture and have a viable desktop competitor to SB in the Windows 7, 8 market?


Of course there is. As the article so eloquently stated:

"The difference between improving IPC a bit and losing double digit percentages is a very fine line, and the devil is in the details "

That's pretty much true of the whole arch. It has real potential.

October 18, 2011 11:51:03 PM

I notice in another blog the announcement of a B3 revision linked to an errata page.Since the BD B2 was just released, how long will it take for a B3 release? Or will this follow the path of the original Phenom and then the Phenom II?
a b à CPUs
October 18, 2011 11:58:51 PM

guskline said:
This discussion of the BD failings is all well and good; however, is there a realistic way to improve upon this BD architecture and have a viable desktop competitor to SB in the Windows 7, 8 market?

Quote:
Of course there is. As the article so eloquently stated:

"The difference between improving IPC a bit and losing double digit percentages is a very fine line, and the devil is in the details "

That's pretty much true of the whole arch. It has real potential.

Agreed, there are quite a few little mishaps that are in the first iteration of Bulldozer's architecture. Fix those and it should be improved quite a bit over this first gen version.
Quote:
I notice in another blog the announcement of a B3 revision linked to an errata page.Since the BD B2 was just released, how long will it take for a B3 release? Or will this follow the path of the original Phenom and then the Phenom II?

This is most likely the fix we've been hearing about for Bulldozer. B2 has been around for some time, so B3 might be able to come out in late Q4. I would expect it to be more of an early Q1 release before Ivy Bridge comes out.
a b à CPUs
October 19, 2011 1:10:02 AM

BaronMatrix said:
Similar to mine it sounds like. Because Maya, etc are hardcore apps, they will get the XOP\AVX\FMAC treatment first. That's where BD shines RIGHT NOW.

After reading several posts, MEH, not going to bother with this. I just find it amazing when people become CPU gurus. Almost like over night.

I would actually like comments from everybody on this if you don't mind. ;) 

What is the difference between the XOP and AVX codepath for Bulldozer? I've heard that it performs just about the same, but I don't know much about XOP.
!