Actually, there is a link in plane sight to a 37 page AMD developer pdf
, which he got his info from.
10x for the info
It was informative. Now we know that K10 FP pipeline is 1 stage longer than the FP pipeline of K8. (see page 24)
It's effect should be negligible. But the 64bit/cycle store bandwidth (or 128bit/2 cycles) might prove more influential as discussed in the original aceshardware thread. It would affect even simple SSE copy loops (think of Sandra's cache bandwidth measurements).
I think you read that wrong. It says the data is transferred in 128bit blocks and decoded to 2 64 bit chnks which can both be written at the same time as K10 has two store ports (page 25).
The same page says it will be better to use SSE copy loops because of the aditional bandwidth.