Trying to understand how AMD FX series work

Pulssqt

Reputable
Oct 21, 2014
171
0
4,690
Hello..

Ill just type what i think i know, then you correct me if I'm wrong.

So basically, It's an 8 core processor.
It has 4 modules, each containing 2 cores, that share some resources, including FPU and L2 Cache.
1 Module has 256 bit FPU, and can also act as 2x128 bit FPU if needed, so each core can do It's own job.

Each module has decoders, fetch, EUs, I/O pipeline. Piledriver cores share decoders and fetch.

Now thing i don't understand.
When they can and can't use 4 or 8 cores ? If they share FPU, does that mean that they can constantly use 8 core normally if each has 128bit FPU ?
OR, when one core is working, other one is 'waiting' ? But why is there 2x128bit then ? Is it because 8 cores can sometimes do the work, and sometimes they need to do it with 4 ?
 
Solution

Shain Taylor

Honorable
Mar 21, 2013
411
0
10,960
If a program uses the FPU, the CPU only has 4 FPU's, so the program thinks it has 4 cores. When a program does not use FPU's and is coded to specifically use the cores, then it will use x/8 that the program desires.
 

Pulssqt

Reputable
Oct 21, 2014
171
0
4,690


OHH like that!!! Thank you.

So, program will decide does he 'want' to use cores or FPUs, right ?
If cores, it will use 8, if FPUs, it will use 4.

And for what is 2x128bit FPU then ? Without it it wouldnt be able to use all 8 cores at the same time ?
 


All x86 CPUs from AMD and Intel use a compatible logical processor interface. The physical arrangement of the resources inside of the CPU package may affect performance, but they do not affect capability.

Understanding how this affects the FPU requires knowledge of how the FPU used in Intel's and AMD's microprocessors works.

In the early 1990s the FPU was a coprocessor that performed scalar floating point arithmetic. Intel integrated the FPU into the CPU starting with the 80486. AMD of course followed suit. This operational stack is called x87 and uses its own set of 8x80-bit CPU registers in addition to the original 8x32-bit general purpose registers.

In the mid to late 1990s Intel began adding additional instruction sets to extend the arithmetic capabilities of their microarchitecture. The first of these was called MMX (it's not an acryonym). MMX shares the same CPU registers as the x87 floating point operations but enables vector integer math. Vector operations are not possible on the general purpose registers. Vector operations enable data-level-parallelism by performing the same operation on multiple sets of data. For example, a general purpose CPU register on a Pentium MMX can be treated as an 8-bit value, a 16-bit value, or a 32-bit value. A 64-bit MMX register (sharing the same physical location as an 80-bit x87 register) can be treated as a pair of 32-bit values, four 16-bit values, or eight 8-bit values. This enables acceleration of certain mathematical formulas that are highly data parallel such as matrix multiplication. The CPU can perform up to eight arithmetic operations at once.

Next up is the SSE stack. SSE (Streaming SIMD Extensions) was designed to integrate and replace x87 and MMX. SSE debuted with 8x128-bit registers and successive revisions (SSE,SSE2,SSE3,SSSE3,SSE4.1,SSE4.2) have added new instructions. Right now the SSE stack is responsible for vector integer arithmetic, vector logicals, scalar floating point arithmetic, and vector floating point arithmetic. The SSE stack was extended to 16x128-bit registers for use with the 64-bit long mode.

The newest revision is AVX (Advanced Vector Extensions), extends the SSE stack to 256-bit registers. AVX adds 256-bit vector floating point operations, and AVX2 adds 256-bit vector integer operations. AVX-512 further extends the registers to 512 bits, but right now AVX-512 is only used on Intel's Xeon Phi coprocessors.

AMD's implementation of the floating point hardware on their FX series microprocessors is split. The unit may operate on separate SSE instructions from both of the frontends to which it is coupled at the same time, or it may operate on AVX instructions from one of the frontends to which it is coupled. If both frontends issue AVX instructions, one of them will have to wait in the reservation station temporarily.
 
Solution

Pulssqt

Reputable
Oct 21, 2014
171
0
4,690


Thank you ! Im just gonna continue to read about that so i can understand everything you wrote.
 


Feel free to send any more questions that you may have my way.
 

Pulssqt

Reputable
Oct 21, 2014
171
0
4,690

Thank you. Will do.

One q for now.

Ill try to make it simple.
So its up to program 'to decide' does he want to use cores(8) or fpus(4). Not that cpu decides what he wants to use, but different programs use different resources ?

Or I'm completley wrong.

 


Programs are run through what are called logical processors. Each logical processor acts as a state tracker for one of the microprocessor's architectural front ends. AMD's FX series microprocessors expose two logical processors per module (one per core) while Intel's i7 series microprocessors expose two logical processors per core (Hyperthreading). Instructions from the front end(s) of each core are decoded and issued to the back end where the actual execution happens. Although each logical processor may not be architecturally independent and there may be certain performance considerations involved with using them properly each logical processor is functionally independent. Instructions from a thread running on a logical processor modify the state of that logical processor (and when saved to memory, the thread) and the system's memory only.

The program itself has little control over what logical processors it uses. Scheduling threads onto the computer's logical processors is entirely the responsibility of the kernel thread scheduler. By default, most operating systems will allow a process to create as many kernel threads (these are the threads that the kernel schedules onto logical processors, user threads are a different matter) as the process wishes and will allow the process to provide hints on how to schedule those threads. For example, in an operating system that supports both multiprocessing and multithreading each process starts with one kernel thread that inherits the process's default priority (usually normal) and default affinity (usually all logical processors). The process may then ask the operating system to create additional kernel threads as needed, as well as alter the priority and affinity on a per-thread basis.

So, if a process wishes to use eight logical processors, it must spawn a total of at least eight kernel threads. It can do this by spawning either additional threads for a multithreaded environment (this is becoming increasingly popular), by spawning additional child processes in a multiprocess environment, or a mix of childprocesses each with more than one thread for a mix of both multiprocessing and multithreading. When and how the kernel threads corresponding to each process get scheduled on the logical processors is up to the operating system's thread scheduler.

As for the use of the SSE/AVX stack, they behave just like the general purpose stack. Each kernel thread has a location in memory which stores the state of the registers as they appeared when the thread was descheduled. When the thread is descheduled, the values in each of the registers are saved to memory. When the thread is scheduled again, the values are loaded from memory back into the registers. From the perspective of the thread, there is no real time gap between the time that it was descheduled from the logical processor and the time that it was rescheduled on the logical processor. It is blissfully unaware that there is anything else running on the computer. What the program does with the registers to which it has access is entirely up to that program. The only requirement is that the operating system's scheduler must be aware of the stack's existence so that it knows to save the registers.
For example, the x86_64 ABI was standardized around SSE3 which means that all 64-bit operating systems must save the 16x64-bit general purpose registers and the 16x128-bit SIMD registers to memory each time a thread is unloaded and reloaded. The AVX extensions however are not a part of the base x86_64 ABI which means that an operating system that is unaware of the AVX extensions, such as Windows Vista and Windows 7 without SP1, may only save the 128-bit lower half of the 256-bit SIMD registers that is present on microprocessors that support AVX or AVX2. This makes it unsafe for a program to use the AVX extensions in either Windows Vista or Windows 7 (pre SP1).

In order for everything to run nice and smooth the microarchitecture and operating system must be designed in such a way that they respect the defined behaviour of the logical processor and the ABI regardless of how the core is laid out. From the perspective of instruction execution, it doesn't matter if two logical processors share backend execution resources such as the FPU. The total instruction throughput of a shared FPU may be less than the total instruction throughput of discrete FPUs but the end result will be the same; the discrete FPU may just get there faster. Heck, Intel's Hyperthreading works by sharing the entire back end, not just the FPU. All that matters is that the instructions from each logical processor modify the state of that logical processor and the system memory (shared access to system memory is another matter entirely).