Sign in with
Sign up | Sign in
Your question

Very technical K10 information for those interested

Tags:
Last response: in CPUs
Share
a c 99 à CPUs
April 21, 2007 5:13:13 AM

I was looking at some documentation on the GCC C compiler for UNIX and happened to find an interesting tidbit here. The file is part of the processor ability/cost declarations in the very latest development version 4.3.0 of the compiler.

[code:1:71aab0b531]
/* Subroutines used for code generation on IA-32.
Copyright (C) 1988, 1992, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
2002, 2003, 2004, 2005, 2006, 2007 Free Software Foundation, Inc.

This file is part of GCC.

<snip>

struct processor_costs amdfam10_cost = {
COSTS_N_INSNS (1), /* cost of an add instruction */
COSTS_N_INSNS (2), /* cost of a lea instruction */
COSTS_N_INSNS (1), /* variable shift costs */
COSTS_N_INSNS (1), /* constant shift costs */
{COSTS_N_INSNS (3), /* cost of starting multiply for QI */
COSTS_N_INSNS (4), /* HI */
COSTS_N_INSNS (3), /* SI */
COSTS_N_INSNS (4), /* DI */
COSTS_N_INSNS (5)}, /* other */
0, /* cost of multiply per each bit set */
{COSTS_N_INSNS (19), /* cost of a divide/mod for QI */
COSTS_N_INSNS (35), /* HI */
COSTS_N_INSNS (51), /* SI */
COSTS_N_INSNS (83), /* DI */
COSTS_N_INSNS (83)}, /* other */
COSTS_N_INSNS (1), /* cost of movsx */
COSTS_N_INSNS (1), /* cost of movzx */
8, /* "large" insn */
9, /* MOVE_RATIO */
4, /* cost for loading QImode using movzbl */
{3, 4, 3}, /* cost of loading integer registers
in QImode, HImode and SImode.
Relative to reg-reg move (2). */
{3, 4, 3}, /* cost of storing integer registers */
4, /* cost of reg,reg fld/fst */
{4, 4, 12}, /* cost of loading fp registers
in SFmode, DFmode and XFmode */
{6, 6, 8}, /* cost of storing fp registers
in SFmode, DFmode and XFmode */
2, /* cost of moving MMX register */
{3, 3}, /* cost of loading MMX registers
in SImode and DImode */
{4, 4}, /* cost of storing MMX registers
in SImode and DImode */
2, /* cost of moving SSE register */
{4, 4, 3}, /* cost of loading SSE registers
in SImode, DImode and TImode */
{4, 4, 5}, /* cost of storing SSE registers
in SImode, DImode and TImode */
3, /* MMX or SSE register to integer */
/* On K8
MOVD reg64, xmmreg Double FSTORE 4
MOVD reg32, xmmreg Double FSTORE 4
On AMDFAM10
MOVD reg64, xmmreg Double FADD 3
1/1 1/1
MOVD reg32, xmmreg Double FADD 3
1/1 1/1 */
64, /* size of prefetch block */
/* New AMD processors never drop prefetches; if they cannot be performed
immediately, they are queued. We set number of simultaneous prefetches
to a large constant to reflect this (it probably is not a good idea not
to limit number of prefetches at all, as their execution also takes some
time). */
100, /* number of parallel prefetches */
5, /* Branch cost */
COSTS_N_INSNS (4), /* cost of FADD and FSUB insns. */
COSTS_N_INSNS (4), /* cost of FMUL instruction. */
COSTS_N_INSNS (19), /* cost of FDIV instruction. */
COSTS_N_INSNS (2), /* cost of FABS instruction. */
COSTS_N_INSNS (2), /* cost of FCHS instruction. */
COSTS_N_INSNS (35), /* cost of FSQRT instruction. */

/* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
very small blocks it is better to use loop. For large blocks, libcall can
do nontemporary accesses and beat inline considerably. */
{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
{libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
{{libcall, {{8, loop}, {24, unrolled_loop},
{2048, rep_prefix_4_byte}, {-1, libcall}}},
{libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}
};
[/code:1:71aab0b531]

The "amdfam10" processor is explicitly defined as "Barcelona" later on in the file:

[code:1:71aab0b531]
{"amdfam10", PROCESSOR_AMDFAM10, PTA_MMX | PTA_PREFETCH_SSE | PTA_3DNOW
| PTA_64BIT | PTA_3DNOW_A | PTA_SSE
| PTA_SSE2 | PTA_SSE3 | PTA_POPCNT
| PTA_ABM | PTA_SSE4A | PTA_CX16},
{"barcelona", PROCESSOR_AMDFAM10, PTA_MMX | PTA_PREFETCH_SSE | PTA_3DNOW
| PTA_64BIT | PTA_3DNOW_A | PTA_SSE
| PTA_SSE2 | PTA_SSE3 | PTA_POPCNT
| PTA_ABM | PTA_SSE4A | PTA_CX16},
[/code:1:71aab0b531]

The "SSE4A" instructions in the K10 are also defined:

[code:1:71aab0b531]
/* AMDFAM10 - SSE4A New Instructions. */
IX86_BUILTIN_MOVNTSD,
IX86_BUILTIN_MOVNTSS,
IX86_BUILTIN_EXTRQI,
IX86_BUILTIN_EXTRQ,
IX86_BUILTIN_INSERTQI,
IX86_BUILTIN_INSERTQ,

IX86_BUILTIN_VEC_INIT_V2SI,
IX86_BUILTIN_VEC_INIT_V4HI,
IX86_BUILTIN_VEC_INIT_V8QI,
IX86_BUILTIN_VEC_EXT_V2DF,
IX86_BUILTIN_VEC_EXT_V2DI,
IX86_BUILTIN_VEC_EXT_V4SF,
IX86_BUILTIN_VEC_EXT_V4SI,
IX86_BUILTIN_VEC_EXT_V8HI,
IX86_BUILTIN_VEC_EXT_V2SI,
IX86_BUILTIN_VEC_EXT_V4HI,
IX86_BUILTIN_VEC_SET_V8HI,
IX86_BUILTIN_VEC_SET_V4HI,

IX86_BUILTIN_MAX
[/code:1:71aab0b531]

A little bit more information about the K10's low-level arch is seen in the Athlon's config file athlon.md Again, AMDFAM10 is the K10.

This information probably won't be of much interest to those who aren't super-geeky and write UNIX compilers. But this is in a concurrent versions system (CVS) management system, which means that the previous versions of the files and who modified them are recorded. Here's what that pulls up:

Athlon.md: All of the "amdfam10" stuff was added Feb 5th by user "hjagasia."

i386.c: "dwarak" added the Barcelona as a variant of the amdfam10 arch March 28th. User "hjagasia" added the "amdfam10" stuff, also on Feb 5th.

This same user also modified several other files, such as ammintrin.h, pmmintrin.h and tmmintrin.h to add the K10 instructions on the same date.

So this leads me to believe that that there were K10s working well before Feb 5th as that's when the code was submitted to GNU for inclusion in the compiler. I would suppose that the compiler would have to be tested on the CPU in question, so K10s must have been made and working for some time before then.
April 21, 2007 5:37:42 AM

Well done, and good sleuthing. Thanks for "translating" that into usable information, too.
April 21, 2007 6:10:53 AM

I'm very glad to see that SSE4a is at least confirmed now.
April 21, 2007 8:29:11 AM

Quote:
So this leads me to believe that that there were K10s working well before Feb 5th as that's when the code was submitted to GNU for inclusion in the compiler. I would suppose that the compiler would have to be tested on the CPU in question, so K10s must have been made and working for some time before then.


I'm interested to know why you think this?
This looks more like code optimisation to me so that the best instruction is used when cleaning stack etc.
The new instructions are, well, just the new instructions. They are documented anyway i thought?
April 21, 2007 8:40:19 AM

Nice find MU_engineer.
Quote:
COSTS_N_INSNS (1), /* cost of an add instruction */
COSTS_N_INSNS (2), /* cost of a lea instruction */

does this means that each addition needs 1 cycle to execute and each load effective address needs 2? :?
April 21, 2007 8:49:10 AM

Quote:
Nice find MU_engineer.
COSTS_N_INSNS (1), /* cost of an add instruction */
COSTS_N_INSNS (2), /* cost of a lea instruction */

does this means that each addition needs 1 cycle to execute and each load effective address needs 2? :?

Well the problem with this is that it depends what you are adding. register to register is 1. reg, mem will be more.
This might just be their weighted balance though.

I think the idea of this table is weight which instructions to use.

and,or,add,sub,shr,shl. The compiler can choose which one to use to complete the operation in less cycles. There are a lot more instructions than that though, but thats what I think this is for.

EDIT:

A quick example here.

var1 = 30
subtract 8 from var 1
answer 22;

now in assembler.

[code:1:e1e551c7b1]
mov ecx, 30;
sub ecx, 8;
[/code:1:e1e551c7b1]

[code:1:e1e551c7b1]
mov ecx, 30;
add ecx, FFFFFFF8;
[/code:1:e1e551c7b1]

Which one is quicker? :wink:
AFAI remember a sub instruction is 2 clocks.
a c 99 à CPUs
April 21, 2007 1:11:19 PM

Quote:
how do they get to use SSE4A when those instructions are SSE3?
i see some new vectors but the others you listed are from SSE3 also that seems like a real short list of instructions for the processor is there more and those are the changes or is that it?

thanks
beer


Good question. Looking at the file and some of the other files like ammintrin.h, it appears that those are the only new instructions for SSE4A. AMD was supposed to be introducing the more SSE4 instructions at some later date, so these might very well be the only ones introduced in the Barcelona/Agena. Also, the user "hjagasia" replied on the mailing list and the e-mail address is an amd.com one, which is not surprising.
a c 99 à CPUs
April 21, 2007 1:17:16 PM

I suppose it would be possible to write the compiler code as long as you know what the instructions will do and what penalties additions, subtractions, etc. have without ever compiling anything with it on the target chip. But since this is an optimization routine, I'd still have to think that it would have had to run a few times to confirm that everything works as it is supposed to. If the chips were not around for testing and no specific optimizations were given for it, then one would simply compile code using generic optimizations. The GCC developers didn't have Core 2 chips until after they shipped (Intel has its own compiler, icc) so that is what C2D users are doing.
!