Additional registers in x86-64

G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

A few weeks ago, AMD published the SPECint2000 score for the FX-53:
http://www.spec.org/cpu2000/results/res2004q3/cpu2000-20040628-03181.html

SPECint2000_peak = 1700
SPECint2000_base = 1601

I see that they used Intel's compiler on Windows XP Professional. Please
correct me if I am wrong. Windows XP is a 32-bit OS, thus the benchmarks
did not use the 8 additional general purpose registers defined in the
x86-64 instruction set, right?

I imagine that, even with 8 more registers available, gcc cannot
outperform Intel's compiler and Microsoft libraries on integer code?

I also noticed Sun's recent SPECfp2000 submission for the Opteron 150:
http://www.spec.org/cpu2000/results/res2004q3/cpu2000-20040712-03241.html

SPECfp2000_peak = 1787
SPECfp2000_base = 1637

Sun did use a 64-bit OS, and it seems they compiled most benchmarks as
64-bit applications. I imagine the compiler (most often PathScale)
produced SIMD code to use the XMM registers?

In short, I am wondering how much improvement the 8 additional GPRs and
8 additional media registers bring...

--
Regards, Grumble
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

On Fri, 13 Aug 2004 11:03:06 +0200, Grumble <a@b.c> wrote:
>
>A few weeks ago, AMD published the SPECint2000 score for the FX-53:
>http://www.spec.org/cpu2000/results/res2004q3/cpu2000-20040628-03181.html
>
>SPECint2000_peak = 1700
>SPECint2000_base = 1601
>
>I see that they used Intel's compiler on Windows XP Professional. Please
>correct me if I am wrong. Windows XP is a 32-bit OS, thus the benchmarks
>did not use the 8 additional general purpose registers defined in the
>x86-64 instruction set, right?

That is correct.

>I imagine that, even with 8 more registers available, gcc cannot
>outperform Intel's compiler and Microsoft libraries on integer code?

Correct again. The optimizations in GCC are not as good as those in
Intel's compiler, though the difference is generally not huge. Take a
look at the results AMD published for their 'A4800' systems. These
are a bunch of Opteron 144 (1.8GHz) processors running under a variety
of different OSes and using different compilers. The fastest results
they achieved was 1095 using Win2K3 (32-bit OS) + Intel's (32-bit)
compiler. For comparison, SuSE 8 for AMD64 (64-bit OS) + GCC 3.3
(64-bit) they managed 1045, and with SuSE 8 for x86 (32-bit OS) + GCC
3.3 for x86 (32-bit compiler) they turned in a score of 960.

So, in the end AMD showed an 8.8% improvement by going from 32 to
64-bit code, but they saw a 14% improvement going from Linux + GCC
(32-bit ) to Windows + Intel C (also 32-bit).

>I also noticed Sun's recent SPECfp2000 submission for the Opteron 150:
>http://www.spec.org/cpu2000/results/res2004q3/cpu2000-20040712-03241.html
>
>SPECfp2000_peak = 1787
>SPECfp2000_base = 1637
>
>Sun did use a 64-bit OS, and it seems they compiled most benchmarks as
>64-bit applications. I imagine the compiler (most often PathScale)
>produced SIMD code to use the XMM registers?

Presumably yes, it would use SIMD code, the XMM registers and the
extra 8 integer registers (even with FP code you still need some
integer registers).

>In short, I am wondering how much improvement the 8 additional GPRs and
>8 additional media registers bring...

Usually more than enough to make up for the performance loss you would
expect with 64-bit code. Normally, if all else is equal, 64-bit code
is about 5-10% slower than 32-bit code until you blow your memory
limits, at which point 32-bit code just completely breaks down.
That's why most bi-arch systems still use lots of 32-bit applications
if they can, eg Sun's Solaris.

With AMD64 the extra registers have managed to improve the performance
enough that they not only negate this performance loss, but turn it
into a 5-10% performance gain on average. Not bad at all for a fairly
small cost in die space and virtually no changes to the instruction
set. FWIW the reason why AMD only went to 16 registers (still a
pretty low number as compared to most modern processors) is that this
is the most that they could squeeze into the x86 instruction set
without making fairly major changes (they did a pretty damn good job
of this, obviously they actually put some thought into how to extend
x86 to 64-bits as naturally as possible).

-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca
 
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

Tony Hill wrote:
> Correct again. The optimizations in GCC are not as good as those in
> Intel's compiler, though the difference is generally not huge. Take a
> look at the results AMD published for their 'A4800' systems. These
> are a bunch of Opteron 144 (1.8GHz) processors running under a variety
> of different OSes and using different compilers. The fastest results
> they achieved was 1095 using Win2K3 (32-bit OS) + Intel's (32-bit)
> compiler. For comparison, SuSE 8 for AMD64 (64-bit OS) + GCC 3.3
> (64-bit) they managed 1045, and with SuSE 8 for x86 (32-bit OS) + GCC
> 3.3 for x86 (32-bit compiler) they turned in a score of 960.
>
> So, in the end AMD showed an 8.8% improvement by going from 32 to
> 64-bit code, but they saw a 14% improvement going from Linux + GCC
> (32-bit ) to Windows + Intel C (also 32-bit).

Cool, but I wonder why AMD submitted the scores with the Intel 32-bit
compiler and a 32-bit OS, rather than a 64-bit OS with the 64-bit Pathscale
or PGI compilers? These two companies seem to have designed themselves
completely for AMD64, which I'm completely certain the Intel compilers
aren't.

> With AMD64 the extra registers have managed to improve the performance
> enough that they not only negate this performance loss, but turn it
> into a 5-10% performance gain on average. Not bad at all for a fairly
> small cost in die space and virtually no changes to the instruction
> set. FWIW the reason why AMD only went to 16 registers (still a
> pretty low number as compared to most modern processors) is that this
> is the most that they could squeeze into the x86 instruction set
> without making fairly major changes (they did a pretty damn good job
> of this, obviously they actually put some thought into how to extend
> x86 to 64-bits as naturally as possible).

How do we know that the extra performance isn't due to built-in memory
controller and branch prediction?

Yousuf Khan
 

rush

Distinguished
Apr 4, 2004
214
0
18,680
G

Guest

Guest
Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

On Sun, 15 Aug 2004 07:46:17 GMT, "Yousuf Khan" <bbbl67@ezrs.com>
wrote:
>Tony Hill wrote:
>> So, in the end AMD showed an 8.8% improvement by going from 32 to
>> 64-bit code, but they saw a 14% improvement going from Linux + GCC
>> (32-bit ) to Windows + Intel C (also 32-bit).
>
>Cool, but I wonder why AMD submitted the scores with the Intel 32-bit
>compiler and a 32-bit OS, rather than a 64-bit OS with the 64-bit Pathscale
>or PGI compilers? These two companies seem to have designed themselves
>completely for AMD64, which I'm completely certain the Intel compilers
>aren't.

Both of these compilers are still fairly new and they still are not as
fast as Intel's x86 compilers for integer code. Sun submitted some
SPEC CINT results using the Pathscale compiler, and they only managed
a score of 1437/1584 (base/peak) with an Opteron 250 while AMD managed
a score of 1566/1655 with an Opteron 150 using Intel's compiler.
What's more, Sun still had to resort to using GCC for one of their
tests as it was 20% faster on that test than PathCC.

On the floating point side of things though, it's a different story.
Sun's Opteron systems turns in a VERY respectable 1637/1787
(base/peak) score using a combination of GCC, PGI and Pathscale's
compilers. This puts them just about on-par with IBM's Power4 chip,
not bad for a processor that sells for about 1/10th the cost.

>> With AMD64 the extra registers have managed to improve the performance
>> enough that they not only negate this performance loss, but turn it
>> into a 5-10% performance gain on average. Not bad at all for a fairly
>> small cost in die space and virtually no changes to the instruction
>> set. FWIW the reason why AMD only went to 16 registers (still a
>> pretty low number as compared to most modern processors) is that this
>> is the most that they could squeeze into the x86 instruction set
>> without making fairly major changes (they did a pretty damn good job
>> of this, obviously they actually put some thought into how to extend
>> x86 to 64-bits as naturally as possible).
>
>How do we know that the extra performance isn't due to built-in memory
>controller and branch prediction?

Err.. it's not like AMD turns those features off in 32-bit mode on
their Athlon64 and Opteron chips!

-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca