What improvements can boost the K8 m-arch perfromance?

gOJDO

Distinguished
Mar 16, 2006
2,309
1
19,780
There are no informations from relevant sources about the next AMD architecture, but there are rummors that it will be the old K8 improved. Some of the K8 design engineers are talking about this architecture that can be improved and that is promising evolution. I was thinking about it, but I don't have any new ideas, not mentioned before. I was wondering about some weak points of the K8, how can be improved and how it will affect the K8 performance.
The new IMC for the DDR2 on the pre-release sAM2 K8 proved as uneffective becouse of the high latency of DDR2 memory modules.
The K8 singlecores with 128bit DDR IMC are not starving for bandwidth, but I think thats not the case for the dualcores. I guess they will have adventage of more bandwidth, but with low latency memory access as it is the case for the DDR. Anyway, the L2 cache freqfency VS RAM freqfency ratio is much more better with the DDR2-800 than with the DDR-400, but the higher latency is disadvantaging the doubleclocked DDR2 and the overall performance remains almost the same.
I was thinking that LOAD/STORE reorder scheduler (like the one on the future Intel Core arch chips-SmartMemoryAccess) will improve the efficiency of the DDR2 IMC on th K8. Maybe there would be DDR2 modules with lower latency when DRAM chip producers start producing 65nm DRAM cells, but I think that the scheduler will improve the performance in this case too(unlike the Core arch chips that are accessing the memory via the FSB and the northbrige and have almost no advantage of lower latency DDR2).
The shared L3 cache done with Z-RAM is another rumor that is possible. If this happen than larege L3 will boost the K8L performance for cache sensitive apps, but that is not the spirit of K8. They are very fast for multimedia and gaming becouse this kind of software does not need large on-chip cache, but needs faster memory access and more memory bandwidth. With the scheduler(I am thinking about) and the large shared L3 it will be data-streaming monster.
There are rumors that the improved K8 (K8L or K10) will have more issue superscallar cores. That means a rework on almost whole K8 architecture, new fetcher, decoder, branch predictor, widther in-chip buses and etc. I guess this will help also, but I am not sure how effective this improvement will be, counting the number of extra transistors involved.
If K8L(or K10) will be more issue, than there will be more execution units. I guess they will be 128bit, so 128bit SIMD instruction will be achived each cycle. I wonder how the reducing of the FP execution stages will affect the K8 FP performance.
So, what do you think about what I am thinking, am I thinking right?
And what else possibly might boost the current K8 performance?
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
There are a few things that will speed up K8.

First, adding new fpinteger registers. (on tap)
Adding more Out of order buffers. (?)
Better prefetch logic (?)
Increasing clockspeed( on tap) this may be a double edged sword if the next gen is clocked lower.
Doubling the width of the pipeline (?) would produce 6 IPC
Adding more complex decoders
Adding more L2 (this would enable more efficient BW usage)
Increasing HT speed (this would be a future proofing mechanism as PCIe will get faster)

In other words K8 isn't dead yet and AMD does have the man responsible for the Alpha design. At 65nm they will have the momentum to release more aggressive designs at the same cost.

The big thing to remember is that no one really needs faster PCs. Game tech is actually driving innovation, not productivity, but then how interesting can a browser be.
 

shabodah

Distinguished
Apr 10, 2006
747
0
18,980
two things I see as easy to do that should make a large impact on performance are:

1) Improving SSE efficiency, if there is a way to do a SSE instruction in one clock like the Conroe does, it will be huge.
2) L2 cache- a Zram shared cache would be extremely versatile.
3) Raising clockspeed- obvious answer
 

spud

Distinguished
Feb 17, 2001
3,406
0
20,780
There are a few things that will speed up K8.

First, adding new fpinteger registers. (on tap)
Adding more Out of order buffers. (?)
Better prefetch logic (?)
Increasing clockspeed( on tap) this may be a double edged sword if the next gen is clocked lower.
Doubling the width of the pipeline (?) would produce 6 IPC
Adding more complex decoders
Adding more L2 (this would enable more efficient BW usage)
Increasing HT speed (this would be a future proofing mechanism as PCIe will get faster)

In other words K8 isn't dead yet and AMD does have the man responsible for the Alpha design. At 65nm they will have the momentum to release more aggressive designs at the same cost.

The big thing to remember is that no one really needs faster PCs. Game tech is actually driving innovation, not productivity, but then how interesting can a browser be.

1. Float-point integer registers, I won't laugh but that’s very silly.
2. Increasing the buffers size would help adding additional buffers for the K8 at this point would be wasted IC.
3. Sure but it's tough to beat Intel at perfecting.
4. Clock skew is crippling clock speed increases I dare say its almost time to add additional pipeline stages.
5. Currently the entire pipeline of the K8 is 128bit doubling that to 256bit would do nothing other than add IC additionally IPC increases come from additional decoders (more instructions in more instructions out).
6. Nah waste of IC space in my opinion but all the power to them.
7. Bandwidth efficiency isn’t something that the K8 has to worry about so additional L2 doesn't really seem to help, at least for the K8.
8. HT speed right now is pretty well overkill, but more speed isn't a bad thing.

They used to have Jim Keller and last I checked Dirk Meyer was gunning for president of AMD. I kind of doubt he has the same time to apply his skill to AMD's IC design team.
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
There are a few things that will speed up K8.

First, adding new fpinteger registers. (on tap)
Adding more Out of order buffers. (?)
Better prefetch logic (?)
Increasing clockspeed( on tap) this may be a double edged sword if the next gen is clocked lower.
Doubling the width of the pipeline (?) would produce 6 IPC
Adding more complex decoders
Adding more L2 (this would enable more efficient BW usage)
Increasing HT speed (this would be a future proofing mechanism as PCIe will get faster)

In other words K8 isn't dead yet and AMD does have the man responsible for the Alpha design. At 65nm they will have the momentum to release more aggressive designs at the same cost.

The big thing to remember is that no one really needs faster PCs. Game tech is actually driving innovation, not productivity, but then how interesting can a browser be.

1. Float-point integer registers, I won't laugh but that’s very silly.
2. Increasing the buffers size would help adding additional buffers for the K8 at this point would be wasted IC.
3. Sure but it's tough to beat Intel at perfecting.
4. Clock skew is crippling clock speed increases I dare say its almost time to add additional pipeline stages.
5. Currently the entire pipeline of the K8 is 128bit doubling that to 256bit would do nothing other than add IC additionally IPC increases come from additional decoders (more instructions in more instructions out).
6. Nah waste of IC space in my opinion but all the power to them.
7. Bandwidth efficiency isn’t something that the K8 has to worry about so additional L2 doesn't really seem to help, at least for the K8.
8. HT speed right now is pretty well overkill, but more speed isn't a bad thing.

They used to have Jim Keller and last I checked Dirk Meyer was gunning for president of AMD. I kind of doubt he has the same time to apply his skill to AMD's IC design team.


The funny thing is I wrote this before I saw the AMD stuff the Sring Forum. Everything I mention is in that article.

Except the decoders. Obviously they're as dumb as me.

BTW, widening is NOT lengthening. 2 x 128 is wider. Adding L3 will also increase PERFORMANCE by more efficiently using bandwidth. CPU perf is not just IPC.
 

spud

Distinguished
Feb 17, 2001
3,406
0
20,780
There are a few things that will speed up K8.

First, adding new fpinteger registers. (on tap)
Adding more Out of order buffers. (?)
Better prefetch logic (?)
Increasing clockspeed( on tap) this may be a double edged sword if the next gen is clocked lower.
Doubling the width of the pipeline (?) would produce 6 IPC
Adding more complex decoders
Adding more L2 (this would enable more efficient BW usage)
Increasing HT speed (this would be a future proofing mechanism as PCIe will get faster)

In other words K8 isn't dead yet and AMD does have the man responsible for the Alpha design. At 65nm they will have the momentum to release more aggressive designs at the same cost.

The big thing to remember is that no one really needs faster PCs. Game tech is actually driving innovation, not productivity, but then how interesting can a browser be.

1. Float-point integer registers, I won't laugh but that’s very silly.
2. Increasing the buffers size would help adding additional buffers for the K8 at this point would be wasted IC.
3. Sure but it's tough to beat Intel at perfecting.
4. Clock skew is crippling clock speed increases I dare say its almost time to add additional pipeline stages.
5. Currently the entire pipeline of the K8 is 128bit doubling that to 256bit would do nothing other than add IC additionally IPC increases come from additional decoders (more instructions in more instructions out).
6. Nah waste of IC space in my opinion but all the power to them.
7. Bandwidth efficiency isn’t something that the K8 has to worry about so additional L2 doesn't really seem to help, at least for the K8.
8. HT speed right now is pretty well overkill, but more speed isn't a bad thing.

They used to have Jim Keller and last I checked Dirk Meyer was gunning for president of AMD. I kind of doubt he has the same time to apply his skill to AMD's IC design team.


The funny thing is I wrote this before I saw the AMD stuff the Sring Forum. Everything I mention is in that article.

Except the decoders. Obviously they're as dumb as me.

BTW, widening is NOT lengthening. 2 x 128 is wider. Adding L3 will also increase PERFORMANCE by more efficiently using bandwidth. CPU perf is not just IPC.

I never said you were dumb nor did I insinuate that, but if you thought I said that I apologize that was not my intentions, sometimes its how its said not what is said, and on occasion I tend to not to think about how others may interpret it.

It may help the K8 but from what has been seen the K8 is not bandwidth starved.
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
In terms of bandwidth from main memory, no they aren't yet, but by increasing the bandwidth of the cache by adding L3 they will start to be. Even with just the other improvements they may be able to complete decodes faster. The actual clock pulse is from a separate mechanism so if they can have more work done between clocks then that will produce higher IPC. I can't do any diagrams so if you think about it like DDR maybe you'll understand what I mean.
 

gOJDO

Distinguished
Mar 16, 2006
2,309
1
19,780
In terms of bandwidth from main memory, no they aren't yet, but by increasing the bandwidth of the cache by adding L3 they will start to be. Even with just the other improvements they may be able to complete decodes faster. The actual clock pulse is from a separate mechanism so if they can have more work done between clocks then that will produce higher IPC. I can't do any diagrams so if you think about it like DDR maybe you'll understand what I mean.
1. No, adding L3 may just reduce the access and operations to RAM. So with it, the RAM will be less used, so more bandwidth will be available.
I will agree with Spud, that there is no need for more bandwidth for the dualcore K8(the transition from s754 to s939 proved that was not needed aditional bandwidth for the singlecore K8). But I think that the quadcore K8 will just need extra bandwidth.
2. So, you agree with me. More IPC(instructions per clock) means better performance.
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
1. That's one way to increase bandwidth. If L3 is faster and prefetch is better that means more data can be pulled from main memory.
2. IPC is one way to increease perf. Imagine a 4 issue core that has to wait 1000 cycles every 100,000 cycles because because of RAM (L1, L2, L3 or main memory). Do you think that a 4 issue core that only waits 100 cycles every 100,000 will be faster?
 

1Tanker

Splendid
Apr 28, 2006
4,645
1
22,780
The best thing AMD can do to really boost K8, is buy Conroe from Intel,sans

IHS, and throw an AMD IHS on it. Then tell the media that LGA775 is the

future platform. :eek:

J/K J/K. I foresee the dander raising, on many "fans".



Sorry all....Couldn't resist. That was a golden opportunity to tease.
 

gOJDO

Distinguished
Mar 16, 2006
2,309
1
19,780
1. That's one way to increase bandwidth. If L3 is faster and prefetch is better that means more data can be pulled from main memory.
2. IPC is one way to increease perf. Imagine a 4 issue core that has to wait 1000 cycles every 100,000 cycles because because of RAM (L1, L2, L3 or main memory). Do you think that a 4 issue core that only waits 100 cycles every 100,000 will be faster?
1.Yes, better perfetch means more data fetched, but that means more issue superscalar, or and new decoder, more execution units, and etc.
So, bandiwdth will be needed, but as long as the K8 have the same number of pipeline stages it will be starfving from high latency RAM. L3 can fix this with reducing the RAM operations and buying CPU time becouse of the lost cycles in memory accessing.
2. IPC is the overall instruction/clock done in all of the execution units. So 6 issue superscallar will do more IPC than 3 issue on the same architecture, but not always twice. The architecture and other optimizations are more important for achieveing IPC. For example, both Core Solo(Dothan) and PentiumD(Prescott) are 3 issue superscallars, but the Dothan at much lower clock is achieveing more IPC than the Prescott. It is more efficient. Doubling the resources doesn't mean doubling the performance. The K8L is promissing so much, but it has to be done first and benchmarked then, before we can make any conclusions like we've done for Conroe.
 

islammanjurul

Distinguished
May 18, 2006
21
0
18,510
but will AMD be able to beat intel Conroe, if it adopts the above improvement advised??? me feeling very sorry for AMD to be again goin to b beaten by intel... may b in June it will happen!!!
 

gOJDO

Distinguished
Mar 16, 2006
2,309
1
19,780
but will AMD be able to beat intel Conroe, if it adopts the above improvement advised??? me feeling very sorry for AMD to be again goin to b beaten by intel... may b in June it will happen!!!
They will be able to beat Conroe, but not now. Latter, they will come with new architecture that is better than Conroe. Then, Intel will come with Core3 Quadro, or however they will name their next m-arch, then AMD will come with better thing than Core3 Quadro, and so on. The race never ends, they are producing little faster chips than those they have before and are selling to us for a lot of money. Thats bussines.
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
1. That's one way to increase bandwidth. If L3 is faster and prefetch is better that means more data can be pulled from main memory.
2. IPC is one way to increease perf. Imagine a 4 issue core that has to wait 1000 cycles every 100,000 cycles because because of RAM (L1, L2, L3 or main memory). Do you think that a 4 issue core that only waits 100 cycles every 100,000 will be faster?
1.Yes, better perfetch means more data fetched, but that means more issue superscalar, or and new decoder, more execution units, and etc.
So, bandiwdth will be needed, but as long as the K8 have the same number of pipeline stages it will be starfving from high latency RAM. L3 can fix this with reducing the RAM operations and buying CPU time becouse of the lost cycles in memory accessing.
2. IPC is the overall instruction/clock done in all of the execution units. So 6 issue superscallar will do more IPC than 3 issue on the same architecture, but not always twice. The architecture and other optimizations are more important for achieveing IPC. For example, both Core Solo(Dothan) and PentiumD(Prescott) are 3 issue superscallars, but the Dothan at much lower clock is achieveing more IPC than the Prescott. It is more efficient. Doubling the resources doesn't mean doubling the performance. The K8L is promissing so much, but it has to be done first and benchmarked then, before we can make any conclusions like we've done for Conroe.


AMD is saying the same things about the improvements I suggested, what are you arguing with me for. You just end up saying the same thing I said. Most modern processors have superscalar "potential." It's a matter of routing and registers.

BTW, why did you feel the need to explain what IPC is?
Also, you can't compare Conroe and K8L because K8 has been running circles around everything so if they are keeping the architecture and improving it by 80%+ based on KNOWN techniques, we can easily imply that it will be 50% faster clock for clock ( random number comparison).