Sign in with
Sign up | Sign in
Your question

Collection of AMD K10 data

Tags:
Last response: in CPUs
Share
February 9, 2007 3:07:11 PM



Barcelona ES:

Barcelona die (11 metal layers, 283mm^2, 463M transistors):

Barcelona waffer:


Macro/micro-architectural improvements over K8:

Quad-core
- Native quad-core design
- Redesigned and improved crossbar(northbridge)
- Improved power management
- New level of cache added, L3 VICTIM
Power management - DICE(Dynamic Independent Core Engagement)
- PLLs for each core, clocked independently and varies clock speed depending on usage.
- ODMC power management: ability to shut down read channels if memory is only using writes and vice versa:
* Reduces the power consumption of the memory controller by up to 80% on "many" workloads.
- Aggressive grained clock gating
- Power management state invariant time stamp counter (TSC)
- Enhanced AMD's PowerNow - works independently without OS driver support
Virtualization improvements
- Nested Paging(NP):
* Guest and Host page tables both exist in memory.(The CPU walks both page tables)
* Nested walk can have up to 24 memory acesses! (Hardware caching accelerates the walk)
* "Wire-to-wire" translations are cached in TLBs
* NP eliminates Hypervisor cycles spent managing shadow pages(As much as 75% Hypervisor time)
- Reduced world-switch time by 25%:
* World-switch time: round-trup to Hypervisor and back
Dedicated L1 cache
- 256bit 128kB (64kB instruction/64kB data), 2-way associative
- 2 x 128bit loads/cycle
- lowest latency
Dedicated L2 cache
- 128bit 512kB, 16-way associative
- 128bit bus to northbridge
- reduced latency
- eliminates conflicts common in shared caches - better for virtualization
Shared L3 cache
- 128bit 2MB, 32-way associative
- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy
- Expandable
Independent DRAM controllers
- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency
- Dual channel unbuffered 1066 support(applies to socket AM2+ and s1207+ QFX only)
- Channel Interleaving
Optimized DRAM paging
- Increase page hits
- Decrease page conflicts
Re-architect northbridge for higher bandwidth
- Increase buffer sizes
- Optimize schedulers
- Ready to support future DRAM technologies
Write bursting
- Minimize Rd/Wr Turnaround
DRAM prefetcher
- Track positive and negative, unit and non-unit strides
- Dedicated buffer for prefetched data
- Aggressively fill idle DRAM cycles
Core prefetchers
- DC Prefetcher fills directly to L1 Cache
- IC Prefetcher more flexible
* 2 outstanding requests to any address
HyperTransport 3
- Up to three 16bit cHT links
- Up to 5200MT/s per link
- Un-ganging mode: each 16bit HT link can be divided in two 8bit virutal links
- Can dynamically adjust frequency and bit width to save power
- AC mode (higher latency mode) to allow longer communications distances
- Hot pluggable

Barecelona pipeline (12/18 ALU/FPU stages):

CPU Core IPC Enhancements:
Advanced branch prediction
- Dedicated 512-entry Indirect Predictor
- Double return stacksize
- More branch history bits and improved branch hashing
History-based pattern predictor
32B instruction fetch
- Benefits integer code too
- Reduced split-fetch instruction cases
Sideband Stack Optimizer
- Perform stack adjustments for PUSH/POP operations “on the side”
- Stack adjustments don’t occupy functional unit bandwidth
- Breaks serial dependence chains for consecutive PUSH/POPs
Out-of-order load execution
- New technology allows load instructions to bypass:
* Other loads
* Other stores which are known not to alias with the load
- Significantly mitigates L2 cache latency
TLB Optimisations
- Support for 1G pages
- 48bit physical address (256TB)
- Larger TLBs key for:
* Virtualized workloads
* Large-footprint databases and
* transaction processing
- DTLB:
* Fully-associative 48-way TLB (4K, 2M, 1G)
* Backed by L2 TLBs: 512 x 4K, 128 x 2M
- ITLB:
* 16 x 2M entries
Data-dependent divide latency
Additional fastpath instructions
– CALL and RET-Imm instructions
– Data movement between FP & INT
Bit Manipulation extensions
- LZCNT/POPCNT
SSE extensions
- EXTRQ/INSERTQ (SSE4A)
- MOVNTSD/MOVNTSS (SSE4A)
- MWAIT/MONITOR (SSE3)
Comprehensive Upgrades for SSE
- Dual 128-bit SSE dataflow
- Up to 4 dual precision FP OPS/cycle
- Dual 128-bit loads per cycle
- New vector code, SSE128
- Can perform SSE MOVs in the FP “store” pipe
- Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
- FP Scheduler can hold 36 Dedicated x 128-bit ops
- SSE Unaligned Load-Execute mode:
* Remove alignment requirements for SSE ld-op instructions
* Eliminate awkward pairs of separate load and compute instructions
* To improve instruction packing and decoding efficiency

Most of the informations are from Ben Sander's presentation at AMD FPF 2006, but also there are other informations included from various internet sites.

AMD Software Optimization Guide for K10

Educative articles about K8L(K10):
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2939&p=1
http://www.xbitlabs.com/articles/cpu/display/amd-k8l.html
http://www.extremetech.com/article2/0,1697,2027644,00.asp
http://www.eetimes.com/news/semi/showArticle.jhtml;jsessionid=RI1Y1J5K4CYB2QSNDLRSKHSCJUNN2JVN?articleID=193200399
http://www.channelinsider.com/article/AMD+Unveils+Barcelona+QuadCore+Details/191008_1.aspx
http://www.realworldtech.com/page.cfm?ArticleID=RWT060206035626
http://www.theregister.co.uk/2007/02/11/amd_enhanced_powernow/
http://www.tgdaily.com/2007/02/11/amd_barcelona/

AMD official statements and public presentations:
Syndrome-oc interview with Giuseppe Amato & Philip G. Eisler
HEXUS interview with Patrick Patla, Director of AMD Server Workstation Division
Interview with Randy Allen, AMD's corporate vice president for servers and workstations
AMD Developer Day, London Dec/06/2006 presentation
Game Developers Conference 2007, Justin Boggs AMD

Roadmap(speculative):

Server:
Opteron 8272SE 2.6GHz 120W TDP, socket F, 3 cHT links, 3600MT/s, DDR2-667, Q2 2008
Opteron 8270SE 2.5GHz 95W TDP, socket F, 3 cHT links, 3400MT/s, DDR2-667, Q3 2007
Opteron 8268SE 2.4GHz 89W TDP, socket F, 3 cHT links, 3400MT/s, DDR2-667, Q3 2007
Opteron 8266 2.3GHz 89W TDP, socket F, 3 cHT links, 3200MT/s, DDR2-667, Q3 2007
Opteron 8264 2.2GHz 89W TDP, socket F, 3 cHT links, 3200MT/s, DDR2-667, Q3 2007
Opteron 8262 2.1GHz 89W TDP, socket F, 3 cHT links, 3000MT/s, DDR2-667, Q3 2007
Opteron 8260HE 2.0GHz 68W TDP, socket F, 3 cHT links, 3000MT/s, DDR2-667, Q4 2007
Opteron 8258HE 1.9GHz 68W TDP, socket F, 3 cHT links, 2800MT/s, DDR2-667, Q4 2007
Opteron 2272SE 2.6GHz 120W TDP, socket F, 2 cHT links, 3600MT/s, DDR2-667, Q2 2008
Opteron 2270SE 2.5GHz 95W TDP, socket F, 2 cHT links, 3400MT/s, DDR2-667, Q3 2007
Opteron 2268SE 2.4GHz 89W TDP, socket F, 2 cHT links, 3400MT/s, DDR2-667, Q3 2007
Opteron 2266 2.3GHz 89W TDP, socket F, 2 cHT links, 3200MT/s, DDR2-667, Q3 2007
Opteron 2264 2.2GHz 89W TDP, socket F, 2 cHT links, 3200MT/s, DDR2-667, Q3 2007
Opteron 2262 2.1GHz 89W TDP, socket F, 2 cHT links, 3000MT/s, DDR2-667, Q3 2007
Opteron 2260HE 2.0GHz 68W TDP, socket F, 2 cHT links, 3000MT/s, DDR2-667, Q4 2007
Opteron 2258HE 1.9GHz 68W TDP, socket F, 2 cHT links, 2800MT/s, DDR2-667, Q4 2007
Opteron 1370SE 2.5GHz 95W TDP, socket AM2+, 1 cHT link, 3400MT/s, DDR2-1067, 2008
Opteron 1368SE 2.4GHz 89W TDP, socket AM2+, 1 cHT link, 3400MT/s, DDR2-1067, 2008
Opteron 1366 2.3GHz 89W TDP, socket AM2+, 1 cHT link, 3200MT/s, DDR2-1067, 2008
Opteron 1364 2.2GHz 89W TDP, socket AM2+, 1 cHT link, 3200MT/s, DDR2-1067, 2008
Opteron 1362 2.1GHz 89W TDP, socket AM2+, 1 cHT link, 3000MT/s, DDR2-1067, 2008

Desktop:
AgenaFX 2.6GHz, unknown TDP, socket 1207+(Quad FX), 2 cHT links, 3600MT/s. DDR2-1067, Q3 2007
AgenaFX 2.4GHz, unknown TDP, socket 1207+(Quad FX), 2 cHT links, 3600MT/s. DDR2-1067, Q3 2007
AgenaFX 2.4GHz, unknown TDP, socket 1207+(Quad FX), 2 cHT links, 3200MT/s. DDR2-1067, Q3 2007
AgenaFX 2.2GHz, unknown TDP, socket 1207+(Quad FX), 2 cHT links, 3200MT/s. DDR2-1067, Q3 2007
AgenaFX 2.4GHz, unknown TDP, socket AM2+, 2 cHT links, 3200MT/s. DDR2-1067, Q3 2007
AgenaFX 2.2GHz, unknown TDP, socket AM2+, 2 cHT links, 3200MT/s. DDR2-1067, Q3 2007
Agena 2.4GHz, 89W TDP, socket AM2+, 1 cHT link, 3600MT/s, DDR2-1067, Q4 2007
Agena 2.2GHz, 89W TDP, socket AM2+, 1 cHT link, 3200MT/s, DDR2-1067, Q4 2007
Kuma(dualcore) 2.8GHz, 89W TDP, socket AM2+, 1 cHT link, 4200MT/s, DDR2-1067, Q4 2007
Kuma(dualcore) 2.6GHz, 65W TDP, socket AM2+, 1 cHT link, 3800MT/s, DDR2-1067, Q4 2007
Kuma(dualcore) 2.4GHz, 65W TDP, socket AM2+, 1 cHT link, 3600MT/s, DDR2-1067, Q4 2007
Kuma(dualcore) LP 2.3GHz, 45W TDP, socket AM2+, 1 cHT link, 3400MT/s, DDR2-1067, Q1 2008
Kuma(dualcore) LP 2.1GHz, 45W TDP, socket AM2+, 1 cHT link, 3000MT/s, DDR2-1067, Q1 2008
Kuma(dualcore) LP 1.9GHz, 45W TDP, socket AM2+, 1 cHT link, 2800MT/s, DDR2-1067, Q1 2008


Roadmap sources:
http://www.dailytech.com/Final+AMD+Stars+Models+Unveiled+/article7157.htm?www.reghardware.co.uk
http://www.dailytech.com/More+Details+on+AMD+Stars+Chipsets/article7147.htm (2007-05-03)
(2007-04-15)
http://www.hkepc.com/bbs/itnews.php?tid=746541&starttime=0&endtime=0" target="_blank">http://www.hkepc.com/bbs/itnews.php?tid=746541&starttime=0&endtime=0]http://www.cpilive.net/v3/inside.aspx?scr=n&NID=1320 (2007-04-15)
http://www.hkepc.com/bbs/itnews.php?tid=746541&starttime=0&endtime=0[/quote] http://trackingamd.blogspot.com/2007/02/barcelona-model-numbers-revealed.html
http://www.hkepc.com/bbs/itnews.php?tid=709944
http://www.dailytech.com/AMD+Quadcore+Opteron+Models+Unveiled/article5992.htm


P.S. Any additional data or informations will be highly appreciated" target="_blank"> (2007-02-23)
http://trackingamd.blogspot.com/2007/02/barcelona-model-numbers-revealed.html
http://www.hkepc.com/bbs/itnews.php?tid=709944
http://www.dailytech.com/AMD+Quadcore+Opteron+Models+Unveiled/article5992.htm


P.S. Any additional data or informations will be highly appreciated" target="_blank">http://trackingamd.blogspot.com/2007/02/barcelona-model-numbers-revealed.html
http://www.hkepc.com/bbs/itnews.php?tid=709944
http://www.dailytech.com/AMD+Quadcore+Opteron+Models+Unveiled/article5992.htm


P.S. Any additional data or informations will be highly appreciated" target="_blank"> (2007-02-23)
http://trackingamd.blogspot.com/2007/02/barcelona-model-numbers-revealed.html
http://www.hkepc.com/bbs/itnews.php?tid=709944
http://www.dailytech.com/AMD+Quadcore+Opteron+Models+Unveiled/article5992.htm


P.S. Any additional data or informations will be highly appreciated
February 9, 2007 3:19:40 PM

Comprehensive list, good job.
February 9, 2007 3:54:40 PM

yeah, thanks.
Related resources
February 9, 2007 4:11:11 PM

Nice compilation gOJDO :!:
I dont pretend to understand what all that will mean in terms of processing power.

Your opinion if you please, do you think it will be the monster they claim ?

Also curious: expandable L3... at the fab or could a user plug in a microSD card next to the cpu socket 8O ?
February 9, 2007 4:11:36 PM

the info is great! thanx alot!
February 9, 2007 4:28:35 PM

Quote:
Nice compilation gOJDO :!:
I dont pretend to understand what all that will mean in terms of processing power.

Your opinion if you please, do you think it will be the monster they claim ?

Yes, it will be a monster, for sure. It is a quadcore after all. How it will compete against current Core2 Quad depends on its the frequency. IMO, clock for clock both will perform similar. One will be marginally faster than the other. K8L will be a FPU monster and will spank Core2 in FP calculations, but it will loose in AL calculations. Both will have roughly same SSE performance.

Quote:
Also curious: expandable L3... at the fab or could a user plug in a microSD card next to the cpu socket 8O ?
The design of the L3 cache logic supports more cache. In 2008, the 45nm K8L shrink will have 6MB of L3 cache.
February 9, 2007 5:02:58 PM

I am already adding these to my upgrade options for 2010 when they will be <$200 :) 
February 9, 2007 5:39:01 PM

Mmm... very thorough list there... and from the looks of things, maybe K10 will be quite something when it is released. Let's just hope AMD doesn't botch the release though...
February 9, 2007 5:58:24 PM

Quote:

Dedicated L1 cache
- 256bit 64kB (32kB instruction/32kB data)
- 2 x 128bit loads/cycle
- lowest latency
Dedicated L2 cache
- 128bit 512kB
- 128bit bus to northbridge
- reduced latency
- eliminates conflicts common in shared caches - better for virtualization
Shared L3 cache
- 128bit 2MB
- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy
- Expandable
Independent DRAM controllers
- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency
- Dual channel unbuffered 1066 support(applies to socket AM2+ and s1207+ QFX only)
- Channel Interleaving


I'm questioning the accuracy of this list because of a few odd details:

1) There was a source which misstated the amount of L1 cache as less than in the K7 and K8 cores. An AMD spokesperson came out and said the L1 cache would most likely be the same size as before. It's been 64K data + 64K instruction since the first Athlons.

2) Referring to L1 as 256-bit and L2/L3 as 128-bit doesn't make technical sense. Exactly what points do the data busses connect? I presume that the L1-L2 bus used to be 64-bit in K7 and soon it'll be 256-bit as in the P4/C2D, a healthy improvement. I also know that since the K8, the cache-IMC bus has been 128-bit, but we all know DRAM sends this data much slower than core clock rate.

3) L3 victim cache, as far as I know, is just another way to say exclusive cache. Both the external and integrated K7 L2 SRAM were also victim caches. There are advantages and disadvantages to this setup. Of course, considering the 2 MB of L2 in quad-core, an inclusive L3 at 2 MB would be of dubious utility.

4) The dual-controller IMC is a thoughtful design that I'd like to see some confirmation on. It seems hard to design without introducing latency relative to a synchronous dual-channel controller.
February 9, 2007 6:01:16 PM

I can't wait to see single-threaded app performance with all that L3 dedicated to one core. If it all fit inside the cache, it would be game over.

That being said...

...it's a quad-core part, I can't wait to see the multithreaded performance, :D rool:
February 9, 2007 6:26:03 PM













February 9, 2007 8:38:06 PM

Quote:
I'm questioning the accuracy of this list because of a few odd details:

1) There was a source which misstated the amount of L1 cache as less than in the K7 and K8 cores. An AMD spokesperson came out and said the L1 cache would most likely be the same size as before. It's been 64K data + 64K instruction since the first Athlons.

2) Referring to L1 as 256-bit and L2/L3 as 128-bit doesn't make technical sense. Exactly what points do the data busses connect? I presume that the L1-L2 bus used to be 64-bit in K7 and soon it'll be 256-bit as in the P4/C2D, a healthy improvement. I also know that since the K8, the cache-IMC bus has been 128-bit, but we all know DRAM sends this data much slower than core clock rate.

3) L3 victim cache, as far as I know, is just another way to say exclusive cache. Both the external and integrated K7 L2 SRAM were also victim caches. There are advantages and disadvantages to this setup. Of course, considering the 2 MB of L2 in quad-core, an inclusive L3 at 2 MB would be of dubious utility.

4) The dual-controller IMC is a thoughtful design that I'd like to see some confirmation on. It seems hard to design without introducing latency relative to a synchronous dual-channel controller.



1) It's still 64K/64K

2) The L1 is bidirectional so it can read and write 128bit blocks.

3) The L3 can dump cache anytime so it will be very useful. If the load is divided evenly across cores (close to even) then each core cn use and flush 512K without going to main memory.

4)The dual controller wil help with simultaneous accesses. If core 0 needs data and core 3 needs data they can both access RAM at teh same time.
February 9, 2007 8:48:58 PM

Well, all I can say " HOLY SH!T!!!" ; that picture is the end of the world,....I feel bad,...I want this thing when it comes out,...but I know I con't have it that soon :cry: 
siiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiigh
February 9, 2007 9:40:28 PM

Quote:
Yes, it will be a monster, for sure. It is a quadcore after all. How it will compete against current Core2 Quad depends on its the frequency. IMO, clock for clock both will perform similar. One will be marginally faster than the other. K8L will be a FPU monster and will spank Core2 in FP calculations, but it will loose in AL calculations. Both will have roughly same SSE performance.

Yes, I would agree with this. I was initially only estimating a 10% improvement in Integer IPC. The talk before was that this was more of an upgrade to K8 than a new core. However, after seeing all of the changes, it clearly is a new core as different from K8 as K8 was from K7. I'm now thinking that it will get close to C2D in Integer IPC but I'm guessing there will still be times when 4 instruction issue and Macro Ops Fusion prevails. Since K8 is faster in FP now the new hardware will only make it that much faster. I agree that SSE should be pretty close although C2D could have a small advantage for some types of carefully tuned data and SSE instructions. Generally, higher clock should prevail.

The L3 is not simply exclusive as the L2 is. The L3 includes a sharing prediction buffer and is capable of guessing whether data should be shared. I believe it is initially liberal with instructions and assumes inclusive and conservative with data and assumes exclusive.

Also, the Crossbar isn't exactly the same as a Northbridge since it not only handles communication with the IMC but the HT links and data transfers between (among) cores as well.

There are only a few updates that I can see for the initial list beyond the correction that the L1 is the same size as for K8.

HyperTransport 3.0
The unganged mode has two channels, not two virtual channels. Basically, a single HT controller is capable of automatically sensing that each half is connected to a separate destination and can use them as two separate channels.

Also:
    [*:6b8a038943]Power Management Enhancements - Can dynamically adjust frequency and bit width to save power.
    [*:6b8a038943]AC mode - Higher latency mode to allow longer communications distances. It is capable of auto-sensing this.
    [*:6b8a038943]Hot Pluggable.

    My compliments on the detail and accuracy of the list.
February 9, 2007 10:21:16 PM

I thought that you were an Intel fanboy, but i guess I was wrong. Great info, it must have taken a great deal of time and effort to compile it. I'm sure a lot of people will appreciate it(if they are anything like me, eager to hear news about new technology). I hope AMD outperforms Intel, and than Intel again does the same to AMD, and we, the users, keep profiting from their competition :) 
Pozdrav!
February 9, 2007 10:35:48 PM

gOJDO,

Good job.

I like what you have done with the place :) .

Quick, Concise, Accurate all words that could be used.

I second the Sticky!!!

Of course as Jack added this would then burden you with updates as info comes in.

With the updates to the following:

Branch Prediction
Sideband Stack Optimization
OOOE
New SSE instruction support
and last but not least L3 Cache with Share support

This thing looks like it has a very real shot at taking back the Crown. They need to release as soon as possible for two reasons:

1) Intel will be close on their heals.
2) I want to play with one :) 
February 9, 2007 11:04:52 PM

Quote:
HyperTransport 3
- up to four 16bit cHT links

HyperTransport 3.0 itself doesn't have any limitation on the number of HT links; this is a function of the cpu hardware. The current design for Barcelona only has 3 HT links (DC 1.0). AMD won't have 4 HT links until it releases Direct Connect architecture 2.0 in 2008. DC 2.0 will also include firmware for keeping track of up to 32 connections. The current firmware can only keep track of 8.

Strictly speaking, all HT links on K8 are cache coherent however on socket 939 and 754 two of the HT buses are left unconnected to the socket (and presumably this will be the case on AM as well). On socket 940 and socket F all HT links are connected so the number of cache coherent links is determined by two license bits on the die. Currently the values are: none, one, and all with the extra value undefined. Single socket processors like Sempron, Athlon 64, and Opteron 1xxx have a cache coherent HT value of "none".

I believe that originally HT 3.0 and DC 2.0 were to be released at the same time however I think AMD moved up the time table for HT 3.0 so some of those benefits may arrive before DC 2.0.

With HT 1.0 or 2.0 the maximum is three 16 bit ccHT on DC 1.0. With HT 3.0 this would be three 16bit ccHT or up to six 8bit ccHT or any combination in between of twin 8's or single 16's. With DC 2.0 and HT 3.0 you can have four 16 bit ccHT or up to eight 8bit ccHT.

Quote:
Nope.... gOJDO is simply one who despises misinfo. and curdled knowledge and confronts it forcibly.

Yes, I can identify with that.
February 9, 2007 11:12:21 PM

Will also be interesting to see how Vistas support for NUMA will impact this processor with its updates.

Given it gave a slight boost to the QFX it may be what is needed to help AMD over that performance crown edge..
February 9, 2007 11:15:02 PM

Quote:
Yes, it will be a monster, for sure. It is a quadcore after all. How it will compete against current Core2 Quad depends on its the frequency. IMO, clock for clock both will perform similar. One will be marginally faster than the other. K8L will be a FPU monster and will spank Core2 in FP calculations, but it will loose in AL calculations. Both will have roughly same SSE performance.

Yes, I would agree with this. I was initially only estimating a 10% improvement in Integer IPC. The talk before was that this was more of an upgrade to K8 than a new core. However, after seeing all of the changes, it clearly is a new core as different from K8 as K8 was from K7. I'm now thinking that it will get close to C2D in Integer IPC but I'm guessing there will still be times when 4 instruction issue and Macro Ops Fusion prevails. Since K8 is faster in FP now the new hardware will only make it that much faster. I agree that SSE should be pretty close although C2D could have a small advantage for some types of carefully tuned data and SSE instructions. Generally, higher clock should prevail.



The L3 is not simply exclusive as the L2 is. The L3 includes a sharing prediction buffer and is capable of guessing whether data should be shared. I believe it is initially liberal with instructions and assumes inclusive and conservative with data and assumes exclusive.

Also, the Crossbar isn't exactly the same as a Northbridge since it not only handles communication with the IMC but the HT links and data transfers between (among) cores as well.

There are only a few updates that I can see for the initial list beyond the correction that the L1 is the same size as for K8.

HyperTransport 3.0
The unganged mode has two channels, not two virtual channels. Basically, a single HT controller is capable of automatically sensing that each half is connected to a separate destination and can use them as two separate channels.

Also:
    [*:68e9654dda]Power Management Enhancements - Can dynamically adjust frequency and bit width to save power.
    [*:68e9654dda]AC mode - Higher latency mode to allow longer communications distances. It is capable of auto-sensing this.
    [*:68e9654dda]Hot Pluggable.

    My compliments on the detail and accuracy of the list.


    I guess the world is coming to an end because I believe that was posted on Sharikou's blog or the Tracking AMD blog.

    Still a good post and fits with what I say EVERYDAY.
February 9, 2007 11:17:46 PM

Scientia,

Thanks for being forward about who you are (in reference to your Nick).

Others have come on THG and other forums while not letting folks know who they are and what their stance is. In effect backdooring the people of these forums.

Being up front with who you are, is I think, the best thing you could have done to show respect to both yourself and to the people of THG.

Thank you.
February 9, 2007 11:21:14 PM

Quote:
Will also be interesting to see how Vistas support for NUMA will impact this processor with its updates.

Given it gave a slight boost to the QFX it may be what is needed to help AMD over that performance crown edge..



I would say that Vista gives more than a slight boost. Mainly because it actually powers down the chips properly. In order to really use QFX you need Vista Ultimate X64 and at least 4GB RAM (well, that's actually the max right now).

Then Vista Super Fetch mechanism and Ready Boost will be (IMHO) twice as good for keeping all processes on one bank of RAM for single threaded apps.

VooDooPC did wait for Vista before releasing their QFX system. And soon with 4cores per socket, one chip will use hardly anything most of the time.

I have Vista 32bit installed and my psychic connection is much clearer than with XP as it was with XP X64.

Aaaah, the power of the mind.
February 9, 2007 11:38:42 PM

Another interesting part of this info is the Bit Manipulation Extensions.

This could in effect help with a problem that is rampant "out there" right now.

Many devs do not take "true performance" into consideration. They know machines are faster and have larger caches of memory/HDD for use. This can make them sloppy. They in effect often do not code specific to the machine they are on even in performance oriented apps.

Adding instructions to speed bit manipulation/shifting could actually help.

Used to be that some systems did or did not bit shift well. In fact some systems could be crippled in a bit shifting/non word aligned environment (please see the DEC Alpha series of 64bit procs).

The Alphas in a word aligned application would actually (in some instances) provide a factored increase in speed of execution. The same app NOT word aligned (bit manip) would slow it so drastically that it would nearly cripple the app.

It seems that AMD has recognized the trend (who really cares about word alignment these days) and has put something in to maybe help with this.

By the way at the same time of the DEC Alphas being crippled the IBM RISC 6000s would rip through bit shifting. Although when word aligned the Alpha would steal and eat the 6000s lunch :) .

I would like to read more about what they are implementing/planning to implement here.
February 9, 2007 11:45:43 PM

Quote:
Another interesting part of this info is the Bit Manipulation Extensions.

This could in effect help with a problem that is rampant "out there" right now.

Many devs do not take "true performance" into consideration. They know machines are faster and have larger caches of memory/HDD for use. This can make them sloppy. They in effect often do not code specific to the machine they are on even in performance oriented apps.

Adding instructions to speed bit manipulation/shifting could actually help.

Used to be that some systems did or did not bit shift well. In fact some systems could be crippled in a bit shifting/non word aligned environment (please see the DEC Alpha series of 64bit procs).

The Alphas in a word aligned application would actually (in some instances) provide a factored increase in speed of execution. The same app NOT word aligned (bit manip) would slow it so drastically that it would nearly cripple the app.

It seems that AMD has recognized the trend (who really cares about word alignment these days) and has put something in to maybe help with this.

By the way at the same time of the DEC Alphas being crippled the IBM RISC 6000s would rip through bit shifting. Although when word aligned the Alpha would steal and eat the 6000s lunch :) .



Interesting, you bring up the Alpha. Word alignment was a major problem when Alpha was still available for Win2000(perhaps one of the reason it was canceled). And these were device drivers. I can imagine the alignment issues for RAM-intensive apps.

8 bytes CAN'T be that bad.

Anyway, I have always said Barcelona would be a beast compared to K8.
February 9, 2007 11:48:05 PM

Quote:
Scientia,

Thanks for being forward about who you are (in reference to your Nick).

Others have come on THG and other forums while not letting folks know who they are and what their stance is. In effect backdooring the people of these forums.

Being up front with who you are is I think the best thing you could have done to show respect to both yourself and to the people of THG.

Thank you.


Seconded.

You guys are giving me a warm, fuzzy feeling inside.
February 9, 2007 11:51:52 PM

Quote:

Anyway, I have always said Barcelona would be a beast compared to K8.


Problem is Baron,

Given the lead that the Conroe took, Barcelona does not need to be a beast compared to K8. It needs to be a beast compared to the current gen Core 2 Duos/Quads.

Given Intel's roadmap the Barcelona really does need to be a performer. I hope it is.

This info looks very promising.

PS..

How many devs (outside of enterprise level DB query devs) do you know that currently know anything about optimization? Many I bet have never heard of or know what O2 optimization means.
February 9, 2007 11:55:20 PM

AJ,

Scientia carries a significant presence in AMDZone. He was forthright in letting people know that.

Not like some others who have an obvious opinion but do not state their affiliation/s.

That is all I was recognizing.

Similar to JkFlipFLop who obviously does not like John Kerry :)  but also has been up front about being an Intel employee.

Edited to add: NMDante as he has been up front about being an Intel employee as well.
February 9, 2007 11:56:45 PM

Quote:
You guys are giving me a warm, fuzzy feeling inside.

Kum Bah Ya :)  Everyone is being so nice! Hopefully, this trend will be contageous! :wink:

Very informative thread! :) 
February 9, 2007 11:58:35 PM

Quote:
AJ,

Scientia carries a significant presence in AMDZone. He was forthright in letting people know that.

Not like some others who have an obvious opinion but do not state their affiliation/s.

That is all I was recognizing.

Similar to JkFlipFLop who obviously does not like John Kerry but also has been up front about being an Intel employee.


I know who Scientia is, I was just making a little joke. It's all good, 8)

I also enjoy it when people get along.

As to K10 needing to be a beast compared to C2D...it won't be. It will be on par at least, no doubt, on the desktop side at least until Intel can crank the clock speed up (and then maybe AMD will be able to do the same?). On the server side, Intel has a much, much tougher task set for it. Current Opteron systems are still competitive in the 4S+ space, Barcelona will be the new 800-pound gorilla.
February 10, 2007 12:16:48 AM

@Scientia
First of all, welcome to THG. I had very low expectations about you. I am glad that I was wrong. At least you have proven that you are a man with balls, unlike the sock puppet noobs who came only to troll and spam here. You might like AMD, but you know your stuff and I respect that. I hope that you will make THG better, balancing the C2 euphoria with your knowledge about AMD. Thank you for your input in this thread.

@Jack
I will ask Jake to make this thread sticky and I will update with any useful data. Also I would love to see any benchmarks. Someone who has benched ES(but can't publish the results), told me that it has wonderfull power management and that it outperforms Clovertown in FP by 40%, clock for clock.

@Baron
About the L1, I found two infos: it will be 128kB(64kB i/64kB d) and it will be 64kB(32kB i/32kB d). According to AMD slides it is 64kB total. Can you provide any data that will determinate which is true?
February 10, 2007 1:12:23 AM

VERY good job gOJDO. I was getting sick of the intel -amd battles. 8)
February 10, 2007 1:21:19 AM

Thank you. I know that the subject of whether I've posted here before under alias has been repeated many times but, no, this is the first time. I got fed up with anonymous posting on my blog and Sharikou's blog. Anonymous posting seems to just make it easier to flame and removes any accountability from the poster. Also, I didn't care for the spoof posting that Sharikou180 has been doing (posting with someone else's ID). Any person with integrity will always own what they say even if they say something really stupid or incorrect. You say, "Oops, I was wrong" and then move on.
------------

To be honest, I've never had a lot of faith in MS's ability to do robust load sharing and management on multi-cpu systems. It is possible that MS could get it right with Vista but I think it is more likely that versions of Unix and Linux will remain the standard on server applications.

MS has a great history of snatching knowledge from other areas as they did when they bought MSDOS from another company, and then with with Xerox PARC and MacOS for Windows, and then VMS for Windows NT. The original team leader for Windows was from Xerox PARC and after he quit Windows was completed by MS's Mac applications team leader. The lead for Windows NT designed VMS which is why the initials are one letter higher V->W, M->N, and S->T. I guess they copied that from HAL. Maybe they'll get it right this time. However, I guess any improvement would be good.

I don't believe the bit instructions LZCNT/POPCNT have anything to do with alignment. I believe POPCNT counts the number of 1 bits in a word and I think LZCNT gives the number of zeros. Itanium has POPCNT which makes it much faster on certain parity benchmarks. However, several of the other hardware improvements should indeed help with alignment including the Prefetch buffer which at 32 bytes is large enough to not break long instructions in half. I recall when Motorola 68000 had to have instructions and data aligned on word boundaries.

Yes, in terms of performance it looks like:

K8 -> K10

will be the same jump as

Yonah -> Core 2 Duo

or K7 -> K8

There is definitely some convergence going on with Intel and AMD architecture. For example it is interesting the way that Intel used a hybrid bus on Tulsa to increase performance; this is a half step to native quad. With the large number of similarities between K10 and the C2D architecture you have to wonder how similar they will be when Intel has both IMC and P2P and both are producing quad cores on 45nm. Intel and AMD also both appear to be pursuing GPU based computation.
February 10, 2007 1:53:26 AM

Quote:
Thank you for your input in this thread.

You're welcome.

Quote:
About the L1, I found two infos: it will be 128kB(64kB i/64kB d) and it will be 64kB(32kB i/32kB d). According to AMD slides it is 64kB total. Can you provide any data that will determinate which is true?

It would seem unlikely to me that AMD would suddenly reduce L1 size. It has been the same ever since K7. Also, to reduce cache size AMD would need to make it greater than 2-way associativity to maintain the same performance. And, on the die shots the L1 cache size seems the same. I suppose it is possible that they've changed it but I haven't thought of any reason that would make it worth the effort.
February 10, 2007 3:29:59 AM

Great posts guys/gals. :) 

Anyway, I posted this over at xtremesystems, but i'll post it here too. Strictly speculation though, so I don't know if it belongs.
IMO K10's frequency numbers seem low. I've read on several occasions about how well IBM's 65nm process is scaling. Of course it's a different design, and call it a hunch but i'm thinking Barcelona is going to clock alot higher. I don't remember seeing much documentation with an AMD stamp on it (such as the ones gOJDO posted) about clocks. I can't imagine AMD designed silicon that would only scale a few hundred Mhz for their server chips.

[edit]
Unless of course those few hundred Mhz make a sizable performance increase.
February 10, 2007 5:02:39 AM

I don't pretend to understand anything in here, but hopefully this thing overclocks like a dog.
February 10, 2007 5:04:42 AM

Core Duo -> C2D was actually a fairly small performance difference in a lot of benchmarks. 1% ~ mid teens percent, and that was heavy multimedia work.
February 10, 2007 5:43:36 AM

As long as Intel is on bulk silicon while AMD is on SOI, Intel chips will always have more overclocking potential. It is the nature of bulk silicon that you have to allow a bigger thermal margin while SOI can be close to the limit. Because of the larger margin Intel chips can usually be clocked higher when done carefully while AMD chips with lower margin cannot. Intel will lose this margin if they move to SOI.

Remember Flippies and HD punches? Intel's margin is the same; usually it can be used but it isn't guaranteed by the factory.
February 10, 2007 7:56:30 AM

Quote:
How it will compete against current Core2 Quad depends on its the frequency. IMO, clock for clock both will perform similar.


Superb post, gOJDO. Given the fact that the current roadmaps show that K10 will not be issued in any frequency over 2.5-2.6GHz for the next 18 months, do you believe that there is anything in the specs that can substantiate the bandied 40-70% performance increase over C2Q?
February 10, 2007 8:31:21 AM

Quote:
How it will compete against current Core2 Quad depends on its the frequency. IMO, clock for clock both will perform similar.


Superb post, gOJDO. Given the fact that the current roadmaps show that K10 will not be issued in any frequency over 2.5-2.6GHz for the next 18 months, do you believe that there is anything in the specs that can substantiate the bandied 40-70% performance increase over C2Q?
We don't know for sure what would be the release frequencies of Barcelona, because we don't have any official AMD info. According to HKEPC and dailytech the highest clocked Barcelona(quadcore), this year, will be 2.5GHz. It will have more than 40% memory bandwidth per CPU, than Clovertown. I am very certain that it will have faster FPU, but I am not sure by how much. I have some info that clock for clock, its FPU is 40% faster, but I don't know how the 40% are measured. So, it is possible that it will perform 40% faster for certain kind of softwares, maybe even more if the software is bandwidth dependent. But, most of the server software is much more INT dependent. Compared to K8(still IMO), K8L will have between 5% and 10% faster ALU, but still slower than C2, clock for clock.
February 10, 2007 9:16:05 AM

40% could also be performance per watt. For example, it is possible that AMD is comparing an Opteron system to a dual or quad FSB Clovertown system with FBDIMM and ending up with 29% less power consumption. As I recall, AMD specifically said that the number could include performance per watt. Or it could be a combination.

BTW, I wouldn't expect higher clocks in 2007. Barcelona uses a different transistor design than Brisbane and it will take some time to bring the speed up. It is possible that they could be bumping the speed by the time they begin 45nm. Supposedly, Barcelona is designed for a 45nm shrink, not to mention that K10 will be modular by the time DC 2.0 is released. I would say that any clock increases will either be with Brisbane or not until 2008 with K10.
February 10, 2007 9:19:39 AM

Good point, gOJDO and Scientia. I guess that the only way to really find out is when they are actually physically benchmarked by an independent third party. Until then, it's all just vapourware speculation.

Don't get me wrong, I am seriously in the market for dual quads and would love nothing better to go AMD (I'm on a San Diego 3700 right now). But price/performance ratio is my religion and I'm just gonna have to make a rational decision when these puppies are out and the initial price burst is over.
February 10, 2007 11:16:41 AM

This slide:

confuses me. I found a lot of articles stating that the L1 would be 128kB, same as on K8. So:
1. the 64kB L1 on the AMD slide is a typo or
2. the L1 is 4-way

from this thread:
Quote:
The L3 cannot be hooked to L2 via the Xbar. If it were configured like this the normal traffic between L2 and L3 would consume too much bandwidth on the Xbar.

The L3 is not connected to L2 at all. It is connected to the L1. But how?
If it is via the Xbar, then it would have to share the same bus with the data from RAM. Doesn't sound very logical to me.
I think that the L1 has 128bit bus to L2 and 128bit bus to L3. This also makes the 128bit(64bit per direction) L2 to L1 bus sounds more logical, because else the 128bit will be a bottleneck for the 256bit L1. The L1 can do two 128bit loads per cycle, one load and one write at cycle, or only one write per cycle. So, here goes the logic: one load from L2, the other from L3. L2 is used for storing the executed code. It is exclusive, thus the stored data can't exist on L3(which could be inclusive, but the exclusivity of L2 prevents the data duplication) at same time. So, what do you think?
February 10, 2007 11:33:40 AM

I think the answer is in the fact that this slide is much older and inaccurate than the other data you submitted.
February 10, 2007 11:56:54 AM

the slide is from the same presentation from Ben Sander 10/10/2006 "AMD FPF 2006"
February 10, 2007 12:33:34 PM

Yes, just that; a PRESENTATION, and that diagram with thin likes has a pretty childish look, so I think the problem is there; a typo.
February 10, 2007 1:05:30 PM

Quote:
This slide:

confuses me. I found a lot of articles stating that the L1 would be 128kB, same as on K8. So:
1. the 64kB L1 on the AMD slide is a typo or
2. the L1 is 4-way

In the context of this slide, it is possible the slide is describing the L1 cache for data and not the L1 cache for instruction.
February 10, 2007 9:06:50 PM

That is basically the same diagram as Phil Hester's June Analyst Day presentation, page 14.

As far as I can tell, the diagram simply didn't have enough room to show both the L1 Instruction and L1 Data caches. I believe the diagrams were simplified to make it easier to show the 3 levels of cache. I would say that the diagram only shows one of the two L1 caches and therefore does not show total L1 size.
February 10, 2007 10:32:08 PM

What do you think about the L1 to L2 and the L1 to L3 buses?
Do you think that L2 is connected to L3 at all?
February 10, 2007 10:52:10 PM

Yes, I have a lot of respect for Hans. He was the one who refuted the notion that K8L had 4 instruction issue. It is too bad that he stopped posting technical articles on his website. Those were first rate. I know of another gentleman who did very detailed analysis of memory and cache on AMD and Intel processors but then he stopped when K8 was introduced.

That thread you linked to was brutal. It had already deteriorated into flaming by the 2nd page and just got worse from there. Yes, it does have real information in it but the flames and personal attacks really knock things down. I once described that on AMDZ as like eating ice cream with sawdust in it. This thread is going smoothly with lots of information and discussion and no flames and that makes a huge difference in how easy it is to read.

Yes, everything I've seen would suggest 64KB's each for Data and Instructions. BTW, I've just seen at AMD their latest press release is Ultimate Datacenter Performance-Per-Watt. This again makes me think that the 40% number is mostly power draw and not processor speed.
February 10, 2007 11:06:23 PM

Quote:
What do you think about the L1 to L2 and the L1 to L3 buses?
Do you think that L2 is connected to L3 at all?

I can see why L3 would only need to be connected to L1 when L3 gets a hit. There wouldn't be any reason to pass this to L2. However, L3 gets updated when L2 overflows. So, it would seem that L2 would need to be connected to L3 to transfer whatever gets pushed out.

However, from the diagram it rather looks like AMD has some kind of crossbar switch on the L1 cache controller. Perhaps this is the most flexible arrangement if not the most direct.
February 10, 2007 11:10:23 PM

Quote:
However, L3 gets updated when L2 overflows. So, it would seem that L2 would need to be connected to L3 to transfer whatever gets pushed out.

Why to push out to L3? The L2 is directly connected to the ODMC, bypassing the crossbar.
!