AMD quad-core Barcelona laid bare

eregular

Distinguished
Dec 8, 2006
266
0
18,780
Other than four cores, the most obvious difference is the new widened SSE instructions. On the pre-Barcelona parts, SSE was done in 64 bit chunks, so if you wanted to do a 128b operation, you needed two passes, possibly more. With the widening of SSE, it should immediately double throughput on SSE instructions. Obviously media operations will benefit, but HPC and FP heavy ops will get a solid kick in the pants too.

http://www.theinquirer.net/default.aspx?article=35011

Nice this is a pretty good article....now if they could just hurry their asses up and let us see some f@#$'in benchmarks!!
 

Dresdenboy

Distinguished
Feb 27, 2007
7
0
18,510
Actually, there is a link in plane sight to a 37 page AMD developer pdf, which he got his info from.
10x for the info :) It was informative. Now we know that K10 FP pipeline is 1 stage longer than the FP pipeline of K8. (see page 24)
It's effect should be negligible. But the 64bit/cycle store bandwidth (or 128bit/2 cycles) might prove more influential as discussed in the original aceshardware thread. It would affect even simple SSE copy loops (think of Sandra's cache bandwidth measurements).
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
Actually, there is a link in plane sight to a 37 page AMD developer pdf, which he got his info from.
10x for the info :) It was informative. Now we know that K10 FP pipeline is 1 stage longer than the FP pipeline of K8. (see page 24)
It's effect should be negligible. But the 64bit/cycle store bandwidth (or 128bit/2 cycles) might prove more influential as discussed in the original aceshardware thread. It would affect even simple SSE copy loops (think of Sandra's cache bandwidth measurements).


I think you read that wrong. It says the data is transferred in 128bit blocks and decoded to 2 64 bit chnks which can both be written at the same time as K10 has two store ports (page 25).

The same page says it will be better to use SSE copy loops because of the aditional bandwidth.



[/quote]
 

gOJDO

Distinguished
Mar 16, 2006
2,309
1
19,780
PAGE 24:

SSE128 adds an additional register read pipe stage
- Only impacts floating point pipeline
- Adds a cycle to FP load latency

The FP pipeline is longer for 1 more stage, 18 stages compared to 17 of K8. Which part you don't understand?