Tom's Hardware Forums » CPU & Components » CPUs » Collection of AMD K10 data
 

Collection of AMD K10 data

Add a reply



 Word :   Username :  
 
 Page :   1  2  3  4
Previous 
Author
 Thread : Collection of AMD K10 data
 
Profile: Faithful Poster
More Information

http://origin.arstechnica.com/staff/carthage.media/barcelona.gif

Barcelona ES:http://img259.imageshack.us/img259/4808/barcelonadieshotbygojdocf8.jpg

Barcelona die (11 metal layers, 283mm^2, 463M transistors):http://pc.watch.impress.co.jp/docs/2007/0214/kaigai_01l.gif

Barcelona waffer:http://img403.imageshack.us/img403/2628/barcelonawaferns2.jpg


Macro/micro-architectural improvements over K8:

Quad-core
- Native quad-core design
- Redesigned and improved crossbar(northbridge)
- Improved power management
- New level of cache added, L3 VICTIM
Power management - DICE(Dynamic Independent Core Engagement)
- PLLs for each core, clocked independently and varies clock speed depending on usage.
- ODMC power management: ability to shut down read channels if memory is only using writes and vice versa:
* Reduces the power consumption of the memory controller by up to 80% on "many" workloads.
- Aggressive grained clock gating
- Power management state invariant time stamp counter (TSC)
- Enhanced AMD's PowerNow - works independently without OS driver support
Virtualization improvements
- Nested Paging(NP):
* Guest and Host page tables both exist in memory.(The CPU walks both page tables)
* Nested walk can have up to 24 memory acesses! (Hardware caching accelerates the walk)
* "Wire-to-wire" translations are cached in TLBs
* NP eliminates Hypervisor cycles spent managing shadow pages(As much as 75% Hypervisor time)
- Reduced world-switch time by 25%:
* World-switch time: round-trup to Hypervisor and back
Dedicated L1 cache
- 256bit 128kB (64kB instruction/64kB data), 2-way associative
- 2 x 128bit loads/cycle
- lowest latency
Dedicated L2 cache
- 128bit 512kB, 16-way associative
- 128bit bus to northbridge
- reduced latency
- eliminates conflicts common in shared caches - better for virtualization
Shared L3 cache
- 128bit 2MB, 32-way associative
- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy
- Expandable
Independent DRAM controllers
- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency
- Dual channel unbuffered 1066 support(applies to socket AM2+ and s1207+ QFX only)
- Channel Interleaving
Optimized DRAM paging
- Increase page hits
- Decrease page conflicts
Re-architect northbridge for higher bandwidth
- Increase buffer sizes
- Optimize schedulers
- Ready to support future DRAM technologies
Write bursting
- Minimize Rd/Wr Turnaround
DRAM prefetcher
- Track positive and negative, unit and non-unit strides
- Dedicated buffer for prefetched data
- Aggressively fill idle DRAM cycles
Core prefetchers
- DC Prefetcher fills directly to L1 Cache
- IC Prefetcher more flexible
* 2 outstanding requests to any address
HyperTransport 3
- Up to three 16bit cHT links
- Up to 5200MT/s per link
- Un-ganging mode: each 16bit HT link can be divided in two 8bit virutal links
- Can dynamically adjust frequency and bit width to save power
- AC mode (higher latency mode) to allow longer communications distances
- Hot pluggable

Barecelona pipeline (12/18 ALU/FPU stages): http://img162.imageshack.us/img162/6479/k8lexecutionpipelineye3.jpg

CPU Core IPC Enhancements:
Advanced branch prediction
- Dedicated 512-entry Indirect Predictor
- Double return stacksize
- More branch history bits and improved branch hashing
History-based pattern predictor
32B instruction fetch
- Benefits integer code too
- Reduced split-fetch instruction cases
Sideband Stack Optimizer
- Perform stack adjustments for PUSH/POP operations “on the side”
- Stack adjustments don’t occupy functional unit bandwidth
- Breaks serial dependence chains for consecutive PUSH/POPs
Out-of-order load execution
- New technology allows load instructions to bypass:
* Other loads
* Other stores which are known not to alias with the load
- Significantly mitigates L2 cache latency
TLB Optimisations
- Support for 1G pages
- 48bit physical address (256TB)
- Larger TLBs key for:
* Virtualized workloads
* Large-footprint databases and
* transaction processing
- DTLB:
* Fully-associative 48-way TLB (4K, 2M, 1G)
* Backed by L2 TLBs: 512 x 4K, 128 x 2M
- ITLB:
* 16 x 2M entries
Data-dependent divide latency
Additional fastpath instructions
– CALL and RET-Imm instructions
– Data movement between FP & INT
Bit Manipulation extensions
- LZCNT/POPCNT
SSE extensions
- EXTRQ/INSERTQ (SSE4A)
- MOVNTSD/MOVNTSS (SSE4A)
- MWAIT/MONITOR (SSE3)
Comprehensive Upgrades for SSE
- Dual 128-bit SSE dataflow
- Up to 4 dual precision FP OPS/cycle
- Dual 128-bit loads per cycle
- New vector code, SSE128
- Can perform SSE MOVs in the FP “store” pipe
- Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads)
- FP Scheduler can hold 36 Dedicated x 128-bit ops
- SSE Unaligned Load-Execute mode:
* Remove alignment requirements for SSE ld-op instructions
* Eliminate awkward pairs of separate load and compute instructions
* To improve instruction packing and decoding efficiency

Most of the informations are from Ben Sander's presentation at AMD FPF 2006, but also there are other informations included from various internet sites.

AMD Software Optimization Guide for K10

Educative articles about K8L(K10):
http://www.anandtech.com/cpuchipse [...] i=2939&p=1
http://www.xbitlabs.com/articles/c [...] d-k8l.html
http://www.extremetech.com/article [...] 644,00.asp
http://www.eetimes.com/news/semi/s [...] =193200399
http://www.channelinsider.com/arti [...] 008_1.aspx
http://www.realworldtech.com/page. [...] 0206035626
http://www.theregister.co.uk/2007/ [...] _powernow/
http://www.tgdaily.com/2007/02/11/amd_barcelona/

AMD official statements and public presentations:
Syndrome-oc interview with Giuseppe Amato & Philip G. Eisler
HEXUS interview with Patrick Patla, Director of AMD Server Workstation Division
Interview with Randy Allen, AMD's corporate vice president for servers and workstations
AMD Developer Day, London Dec/06/2006 presentation
Game Developers Conference 2007, Justin Boggs AMD

Roadmap(speculative):

Server:
Opteron 8272SE 2.6GHz 120W TDP, socket F, 3 cHT links, 3600MT/s, DDR2-667, Q2 2008
Opteron 8270SE 2.5GHz 95W TDP, socket F, 3 cHT links, 3400MT/s, DDR2-667, Q3 2007
Opteron 8268SE 2.4GHz 89W TDP, socket F, 3 cHT links, 3400MT/s, DDR2-667, Q3 2007
Opteron 8266 2.3GHz 89W TDP, socket F, 3 cHT links, 3200MT/s, DDR2-667, Q3 2007
Opteron 8264 2.2GHz 89W TDP, socket F, 3 cHT links, 3200MT/s, DDR2-667, Q3 2007
Opteron 8262 2.1GHz 89W TDP, socket F, 3 cHT links, 3000MT/s, DDR2-667, Q3 2007
Opteron 8260HE 2.0GHz 68W TDP, socket F, 3 cHT links, 3000MT/s, DDR2-667, Q4 2007
Opteron 8258HE 1.9GHz 68W TDP, socket F, 3 cHT links, 2800MT/s, DDR2-667, Q4 2007
Opteron 2272SE 2.6GHz 120W TDP, socket F, 2 cHT links, 3600MT/s, DDR2-667, Q2 2008
Opteron 2270SE 2.5GHz 95W TDP, socket F, 2 cHT links, 3400MT/s, DDR2-667, Q3 2007
Opteron 2268SE 2.4GHz 89W TDP, socket F, 2 cHT links, 3400MT/s, DDR2-667, Q3 2007
Opteron 2266 2.3GHz 89W TDP, socket F, 2 cHT links, 3200MT/s, DDR2-667, Q3 2007
Opteron 2264 2.2GHz 89W TDP, socket F, 2 cHT links, 3200MT/s, DDR2-667, Q3 2007
Opteron 2262 2.1GHz 89W TDP, socket F, 2 cHT links, 3000MT/s, DDR2-667, Q3 2007
Opteron 2260HE 2.0GHz 68W TDP, socket F, 2 cHT links, 3000MT/s, DDR2-667, Q4 2007
Opteron 2258HE 1.9GHz 68W TDP, socket F, 2 cHT links, 2800MT/s, DDR2-667, Q4 2007
Opteron 1370SE 2.5GHz 95W TDP, socket AM2+, 1 cHT link, 3400MT/s, DDR2-1067, 2008
Opteron 1368SE 2.4GHz 89W TDP, socket AM2+, 1 cHT link, 3400MT/s, DDR2-1067, 2008
Opteron 1366 2.3GHz 89W TDP, socket AM2+, 1 cHT link, 3200MT/s, DDR2-1067, 2008
Opteron 1364 2.2GHz 89W TDP, socket AM2+, 1 cHT link, 3200MT/s, DDR2-1067, 2008
Opteron 1362 2.1GHz 89W TDP, socket AM2+, 1 cHT link, 3000MT/s, DDR2-1067, 2008

Desktop:
AgenaFX 2.6GHz, unknown TDP, socket 1207+(Quad FX), 2 cHT links, 3600MT/s. DDR2-1067, Q3 2007
AgenaFX 2.4GHz, unknown TDP, socket 1207+(Quad FX), 2 cHT links, 3600MT/s. DDR2-1067, Q3 2007
AgenaFX 2.4GHz, unknown TDP, socket 1207+(Quad FX), 2 cHT links, 3200MT/s. DDR2-1067, Q3 2007
AgenaFX 2.2GHz, unknown TDP, socket 1207+(Quad FX), 2 cHT links, 3200MT/s. DDR2-1067, Q3 2007
AgenaFX 2.4GHz, unknown TDP, socket AM2+, 2 cHT links, 3200MT/s. DDR2-1067, Q3 2007
AgenaFX 2.2GHz, unknown TDP, socket AM2+, 2 cHT links, 3200MT/s. DDR2-1067, Q3 2007
Agena 2.4GHz, 89W TDP, socket AM2+, 1 cHT link, 3600MT/s, DDR2-1067, Q4 2007
Agena 2.2GHz, 89W TDP, socket AM2+, 1 cHT link, 3200MT/s, DDR2-1067, Q4 2007
Kuma(dualcore) 2.8GHz, 89W TDP, socket AM2+, 1 cHT link, 4200MT/s, DDR2-1067, Q4 2007
Kuma(dualcore) 2.6GHz, 65W TDP, socket AM2+, 1 cHT link, 3800MT/s, DDR2-1067, Q4 2007
Kuma(dualcore) 2.4GHz, 65W TDP, socket AM2+, 1 cHT link, 3600MT/s, DDR2-1067, Q4 2007
Kuma(dualcore) LP 2.3GHz, 45W TDP, socket AM2+, 1 cHT link, 3400MT/s, DDR2-1067, Q1 2008
Kuma(dualcore) LP 2.1GHz, 45W TDP, socket AM2+, 1 cHT link, 3000MT/s, DDR2-1067, Q1 2008
Kuma(dualcore) LP 1.9GHz, 45W TDP, socket AM2+, 1 cHT link, 2800MT/s, DDR2-1067, Q1 2008


Roadmap sources:
http://www.dailytech.com/Final+AMD [...] ware.co.uk
http://www.dailytech.com/More+Deta [...] le7147.htm (2007-05-03)
[url]http://www.cpilive.net/v3/inside.aspx?scr=n&NID=1320[/quote] (2007-04-15)
http://www.hkepc.com/bbs/itnews.ph [...] &endtime=0 (2007-02-23)
http://trackingamd.blogspot.com/20 [...] ealed.html
http://www.hkepc.com/bbs/itnews.php?tid=709944
http://www.dailytech.com/AMD+Quadc [...] le5992.htm


P.S. Any additional data or informations will be highly appreciated

Message quoted 1 times
Message edited by Jake_Barnes on 10-31-2007 at 02:29:31 AM
Related Pr oduct
Register or log in to remove.

Profile: nimble knuckle
More Information

Comprehensive list, good job.

Profile: Forum Resident
More Information

yeah, thanks.

Profile: member
More Information

Nice compilation gOJDO :!:
I dont pretend to understand what all that will mean in terms of processing power.

Your opinion if you please, do you think it will be the monster they claim ?

Also curious: expandable L3... at the fab or could a user plug in a microSD card next to the cpu socket 8O ?

Profile: enthusiast
More Information

the info is great! thanx alot!

Profile: Faithful Poster
More Information

Quote :

Nice compilation gOJDO :!:
I dont pretend to understand what all that will mean in terms of processing power.

Your opinion if you please, do you think it will be the monster they claim ?


Yes, it will be a monster, for sure. It is a quadcore after all. How it will compete against current Core2 Quad depends on its the frequency. IMO, clock for clock both will perform similar. One will be marginally faster than the other. K8L will be a FPU monster and will spank Core2 in FP calculations, but it will loose in AL calculations. Both will have roughly same SSE performance.

Quote :

Also curious: expandable L3... at the fab or could a user plug in a microSD card next to the cpu socket 8O ?

The design of the L3 cache logic supports more cache. In 2008, the 45nm K8L shrink will have 6MB of L3 cache.

Profile: member
More Information

I am already adding these to my upgrade options for 2010 when they will be <$200 :)

Profile: addict
More Information

Mmm... very thorough list there... and from the looks of things, maybe K10 will be quite something when it is released. Let's just hope AMD doesn't botch the release though...

wr
Profile: enthusiast
More Information

Quote :


Dedicated L1 cache
- 256bit 64kB (32kB instruction/32kB data)
- 2 x 128bit loads/cycle
- lowest latency
Dedicated L2 cache
- 128bit 512kB
- 128bit bus to northbridge
- reduced latency
- eliminates conflicts common in shared caches - better for virtualization
Shared L3 cache
- 128bit 2MB
- Victim-cache architecture maximizes efficiency of cache hierarchy
- Fills from L3 leave likely shared lines in the L3
- Sharing-aware replacement policy
- Expandable
Independent DRAM controllers
- Concurrency
- More DRAM banks reduces page conflicts
- Longer burst length improves command efficiency
- Dual channel unbuffered 1066 support(applies to socket AM2+ and s1207+ QFX only)
- Channel Interleaving



I'm questioning the accuracy of this list because of a few odd details:

1) There was a source which misstated the amount of L1 cache as less than in the K7 and K8 cores. An AMD spokesperson came out and said the L1 cache would most likely be the same size as before. It's been 64K data + 64K instruction since the first Athlons.

2) Referring to L1 as 256-bit and L2/L3 as 128-bit doesn't make technical sense. Exactly what points do the data busses connect? I presume that the L1-L2 bus used to be 64-bit in K7 and soon it'll be 256-bit as in the P4/C2D, a healthy improvement. I also know that since the K8, the cache-IMC bus has been 128-bit, but we all know DRAM sends this data much slower than core clock rate.

3) L3 victim cache, as far as I know, is just another way to say exclusive cache. Both the external and integrated K7 L2 SRAM were also victim caches. There are advantages and disadvantages to this setup. Of course, considering the 2 MB of L2 in quad-core, an inclusive L3 at 2 MB would be of dubious utility.

4) The dual-controller IMC is a thoughtful design that I'd like to see some confirmation on. It seems hard to design without introducing latency relative to a synchronous dual-channel controller.

Profile: nimble knuckle
More Information

I can't wait to see single-threaded app performance with all that L3 dedicated to one core. If it all fit inside the cache, it would be game over.

That being said...

...it's a quad-core part, I can't wait to see the multithreaded performance, :drool:

Profile: Faithful Poster
More Information
Profile: Forum Resident
More Information

Quote :

I'm questioning the accuracy of this list because of a few odd details:

1) There was a source which misstated the amount of L1 cache as less than in the K7 and K8 cores. An AMD spokesperson came out and said the L1 cache would most likely be the same size as before. It's been 64K data + 64K instruction since the first Athlons.

2) Referring to L1 as 256-bit and L2/L3 as 128-bit doesn't make technical sense. Exactly what points do the data busses connect? I presume that the L1-L2 bus used to be 64-bit in K7 and soon it'll be 256-bit as in the P4/C2D, a healthy improvement. I also know that since the K8, the cache-IMC bus has been 128-bit, but we all know DRAM sends this data much slower than core clock rate.

3) L3 victim cache, as far as I know, is just another way to say exclusive cache. Both the external and integrated K7 L2 SRAM were also victim caches. There are advantages and disadvantages to this setup. Of course, considering the 2 MB of L2 in quad-core, an inclusive L3 at 2 MB would be of dubious utility.

4) The dual-controller IMC is a thoughtful design that I'd like to see some confirmation on. It seems hard to design without introducing latency relative to a synchronous dual-channel controller.




1) It's still 64K/64K

2) The L1 is bidirectional so it can read and write 128bit blocks.

3) The L3 can dump cache anytime so it will be very useful. If the load is divided evenly across cores (close to even) then each core cn use and flush 512K without going to main memory.

4)The dual controller wil help with simultaneous accesses. If core 0 needs data and core 3 needs data they can both access RAM at teh same time.

m25
Profile: Faithful Poster
More Information

Well, all I can say " HOLY SH!T!!!" ; that picture is the end of the world,....I feel bad,...I want this thing when it comes out,...but I know I con't have it that soon :cry:
siiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiigh

AMD - The Lesser Evil
Profile: Forum Resident
More Information

all i can say is it is nice to see some optimism about AMD instead of the thank god it is safe to slag off AMD and praise intel again attitude.

still, lets not get too excited. it aint released yet.

Profile: stranger
More Information

Quote :

Yes, it will be a monster, for sure. It is a quadcore after all. How it will compete against current Core2 Quad depends on its the frequency. IMO, clock for clock both will perform similar. One will be marginally faster than the other. K8L will be a FPU monster and will spank Core2 in FP calculations, but it will loose in AL calculations. Both will have roughly same SSE performance.


Yes, I would agree with this. I was initially only estimating a 10% improvement in Integer IPC. The talk before was that this was more of an upgrade to K8 than a new core. However, after seeing all of the changes, it clearly is a new core as different from K8 as K8 was from K7. I'm now thinking that it will get close to C2D in Integer IPC but I'm guessing there will still be times when 4 instruction issue and Macro Ops Fusion prevails. Since K8 is faster in FP now the new hardware will only make it that much faster. I agree that SSE should be pretty close although C2D could have a small advantage for some types of carefully tuned data and SSE instructions. Generally, higher clock should prevail.

The L3 is not simply exclusive as the L2 is. The L3 includes a sharing prediction buffer and is capable of guessing whether data should be shared. I believe it is initially liberal with instructions and assumes inclusive and conservative with data and assumes exclusive.

Also, the Crossbar isn't exactly the same as a Northbridge since it not only handles communication with the IMC but the HT links and data transfers between (among) cores as well.

There are only a few updates that I can see for the initial list beyond the correction that the L1 is the same size as for K8.

HyperTransport 3.0
The unganged mode has two channels, not two virtual channels. Basically, a single HT controller is capable of automatically sensing that each half is connected to a separate destination and can use them as two separate channels.

Also:
[list=1:6b8a038943][*:6b8a038943]Power Management Enhancements - Can dynamically adjust frequency and bit width to save power.
[*:6b8a038943]AC mode - Higher latency mode to allow longer communications distances. It is capable of auto-sensing this.
[*:6b8a038943]Hot Pluggable.

My compliments on the detail and accuracy of the list.

Profile: newbie
More Information

I thought that you were an Intel fanboy, but i guess I was wrong. Great info, it must have taken a great deal of time and effort to compile it. I'm sure a lot of people will appreciate it(if they are anything like me, eager to hear news about new technology). I hope AMD outperforms Intel, and than Intel again does the same to AMD, and we, the users, keep profiting from their competition :)
Pozdrav!

Profile: Honorary Poster
More Information

gOJDO,

Good job.

I like what you have done with the place :).

Quick, Concise, Accurate all words that could be used.

I second the Sticky!!!

Of course as Jack added this would then burden you with updates as info comes in.

With the updates to the following:

Branch Prediction
Sideband Stack Optimization
OOOE
New SSE instruction support
and last but not least L3 Cache with Share support

This thing looks like it has a very real shot at taking back the Crown. They need to release as soon as possible for two reasons:

1) Intel will be close on their heals.
2) I want to play with one :)

Profile: stranger
More Information