MMX/SSE/SSE2/3DNow optimization

FatBurger

Illustrious
Ok, there has been a lot of debate about this recently, so I decided to start a thread to discuss this. Here's the info I hope to be seen:

1. What are these individual optimizations? What exactly are they used to do, who invented them, etc.

2. Implementation? How do they help, how easy are they to implement?


Some of the myths I wanted to dispel are:

1. Any program can benefit from any optimization.
2. Two programs being "SSE2 optimized" will benefit from it the same amount, and/or in the same way.
3. Programs either are optimized or not. There is no middle ground, and later programs being optimized will function just as good as the first ones.

The above 3 are untrue (at least to a certain extent).



Anyhow, all you programmers can have your spot in the limelight now :)

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
 

Kelledin

Distinguished
Mar 1, 2001
2,183
0
19,780
1) The best thing to start with is an explanation of MMX and how it is useful.

First of all, the basic 386 integer unit is designed to work with one integer at a time, in sizes of 8, 16, or 32 bits. It can do basic arithmetic (add, subtract, multiply, divide), boolean arithmetic (AND, OR, XOR, NOT, etc.), and rotating and shifting bits. It typically performs operations one at a time, no matter what the operand size is.

MMX uses a separate set of registers which are designed to operate a bit differently. Although they are still primarily integer registers and still do the same operations, each register is capable of holding several operands at once.

2) When a 386 integer unit adds two 8-bit integers, it uses a single 32-bit register for each integer--never mind that 24 of those 32 bits are going to waste in each register. If you have three more pairs of 8-bit integers to add, you have to add one pair at a time; you can't fit four in one register, fit four in another register, and separately add all four pairs with one instruction.

If you have four pairs of 8-bit integers, you can fit them in two different MMX registers and perform all four additions in a single instruction. IIRC, you can even fit four more pairs into those two MMX registers and still take care of all eight pairs in a single operation. This effectively performs a single quick operation on multiple pairs of data--hence the acronym SIMD (for Single Instruction Multiple Data)

This proves most useful when you have one or more lists of values, and you want to perform identical operations on all elements of those lists in sequence. You could perform operations two-at-a-time, four-at-a-time, or eight-at-a-time using MMX (depending on the size of the list elements).

These optimizations are fairly easy to implement if you have an optimizing compiler that knows how to do them. Failing that, they can be done in assembly language, but they are much more difficult, especially if you don't have an assembler that knows about MMX.

Besides the difficulty of coding these optimizations, developers have to consider how backwards-compatible to make their code. If a developer optimizes for MMX exclusively, the compiled code will not work on non-MMX processors. If he optimizes for a 386 integer unit, the resulting code will run on a 386, but it will not perform at its maximum potential on MMX processors. If he optimizes for both architectures, the compiled code must have run-time conditional branching based on the processor's capabilities; this is slightly slower than optimizing exclusively for the faster architecture and makes the resulting executable more bloated in terms of run-time memory consumption and disk space.

Now for FPU discussion...

First of all, Intel's original x87 FPU architecture (IMHO) <b>sucks ass</b>. It was a poor design decision to begin with, but at the time of its inception, it didn't matter much; systems with x87 math coprocessors wouldn't be economically feasible for most people for years to come.

The x87 FPU consists of eight general data registers in a stack-based arrangement (as well as a few other miscellaneous control registers). x87 instructions can typically only work with the two topmost registers of that stack; in order to reach a register below those two, all the registers positioned "on top" of that register typically must have their contents stored somewhere in memory, then put back on the stack later. This is dreadfully inconvenient for developers, it incurs a severe performance penalty, and it makes it very, very difficult for the core to be pipelined for better performance. Intel/AMD design architects have managed it, but the fact is that the x87 FPU in general does not perform as well as the FPU architecture of other RISC platforms (i.e. Alpha, PPC).

Moving on...

MMX is nice, but it only covers integer data streams; it is not designed to handle floating-point numbers (numbers with nonzero digits to the right of the decimal point). AMD cloned the MMX instruction set from Intel, added a few extra MMX instructions of their own, and offered a set of 3Dnow! instructions to handle SIMD tasks with floating-point numbers. Unlike MMX architecture, 3Dnow! architecture merely made use of the floating-point registers already existing in the x87 architecture. This did <i>not</i> mean 3Dnow! could be of use without rewriting code, it just meant that die space was conserved.

Intel responded to 3Dnow! with SSE, a set of SIMD extensions for floating-point which actually utilized a whole new set of conveniently-arranged floating-point registers, rather than use the old stack-based floating-point registers (developers heave a sigh of relief). Now we have SSE2, which is basically SSE with some new (and rather useful) instructions added.

Kelledin
<A HREF="http://www.linuxfromscratch.org/" target="_new">LFS</A>: "You don't eat or sleep or mow the lawn; you just hack your distro all day long."
 

FatBurger

Illustrious
First of all, Intel's original x87 FPU architecture (IMHO) sucks ass.

Please, leave your opinions at the door.

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
 

Conqueror

Distinguished
Nov 8, 2001
87
0
18,630
Off the topic!

FATBURGER, how did you get your name to be written in green? Did you use the markup codes or html codes in your username when you signed up?

<font color=blue><i>Mankind must put an end to War,
or War will put an end to mankind!<i></font color=blue>
 

FatBurger

Illustrious
Please keep this thread on topic, if you have other questions, feel free to PM me.

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
 
G

Guest

Guest
>In reply to:
>> First of all, Intel's original x87 FPU architecture
>> (IMHO) sucks ass.
>Please, leave your opinions at the door.

Saying an Intel architecture sucks is not so much an opinion, as a statement of fact. Intel's implementation of those horrid architectures, on the other hand, is (IMHO) nothing short of miraculous.

I've always wondered what would happen if Intel engineers ever got to implement a good architecture. The closest we have is the X-scale implementation of the ARM architecture. It blows away the speed of the ARM competition, and makes the Intel architected i960 processors look sick.
 
G

Guest

Guest
Nice cut and paste from MIT's microprocessor guide...

-Spuddy

<font color=red>Being Evil Is Good. Cause I Can Be A Prick And Get Away With It.</font color=red> :lol:
 

lhgpoobaa

Illustrious
Dec 31, 2007
14,462
1
40,780
does the 386 play unreal tournament well?
NO.
Thus the original x87 FPU architecture sucks!

tee hee hee

Why do i feel like the lone sane voice in the mental assylum?
 

AmdMELTDOWN

Distinguished
Dec 31, 2007
2,000
0
19,780
I don't think there was a UT at the time but if you look at the graphic demo scene, they did a lot with 386 and 486's!
those coders were brilliant. now days coders are freak'n lazy and write bloated code all day long, just look at linux for example ;-)

"<b>AMD/VIA!</b>...you are <i>still</i> the weakest link, good bye!"
 

Matisaro

Splendid
Mar 23, 2001
6,737
0
25,780
This thread looks like it will be very informative, like the rambus memory speed thread, kudos to fatty for posting it. Please keep it on topic everyone, and not let it become a flame war.(even if this post isnt on topic, just wanted to help burger keep it on task.)

::puts topic on post::

What are some of the new sse2 instructions/extensions?

"The Cash Left In My Pocket,The BEST Benchmark"
No Overclock+stock hsf=GOOD!