1) The best thing to start with is an explanation of MMX and how it is useful.
First of all, the basic 386 integer unit is designed to work with one integer at a time, in sizes of 8, 16, or 32 bits. It can do basic arithmetic (add, subtract, multiply, divide), boolean arithmetic (AND, OR, XOR, NOT, etc.), and rotating and shifting bits. It typically performs operations one at a time, no matter what the operand size is.
MMX uses a separate set of registers which are designed to operate a bit differently. Although they are still primarily integer registers and still do the same operations, each register is capable of holding several operands at once.
2) When a 386 integer unit adds two 8-bit integers, it uses a single 32-bit register for each integer--never mind that 24 of those 32 bits are going to waste in each register. If you have three more pairs of 8-bit integers to add, you have to add one pair at a time; you can't fit four in one register, fit four in another register, and separately add all four pairs with one instruction.
If you have four pairs of 8-bit integers, you can fit them in two different MMX registers and perform all four additions in a single instruction. IIRC, you can even fit four more pairs into those two MMX registers and still take care of all eight pairs in a single operation. This effectively performs a single quick operation on multiple pairs of data--hence the acronym SIMD (for Single Instruction Multiple Data)
This proves most useful when you have one or more lists of values, and you want to perform identical operations on all elements of those lists in sequence. You could perform operations two-at-a-time, four-at-a-time, or eight-at-a-time using MMX (depending on the size of the list elements).
These optimizations are fairly easy to implement if you have an optimizing compiler that knows how to do them. Failing that, they can be done in assembly language, but they are much more difficult, especially if you don't have an assembler that knows about MMX.
Besides the difficulty of coding these optimizations, developers have to consider how backwards-compatible to make their code. If a developer optimizes for MMX exclusively, the compiled code will not work on non-MMX processors. If he optimizes for a 386 integer unit, the resulting code will run on a 386, but it will not perform at its maximum potential on MMX processors. If he optimizes for both architectures, the compiled code must have run-time conditional branching based on the processor's capabilities; this is slightly slower than optimizing exclusively for the faster architecture and makes the resulting executable more bloated in terms of run-time memory consumption and disk space.
Now for FPU discussion...
First of all, Intel's original x87 FPU architecture (IMHO) <b>sucks ass</b>. It was a poor design decision to begin with, but at the time of its inception, it didn't matter much; systems with x87 math coprocessors wouldn't be economically feasible for most people for years to come.
The x87 FPU consists of eight general data registers in a stack-based arrangement (as well as a few other miscellaneous control registers). x87 instructions can typically only work with the two topmost registers of that stack; in order to reach a register below those two, all the registers positioned "on top" of that register typically must have their contents stored somewhere in memory, then put back on the stack later. This is dreadfully inconvenient for developers, it incurs a severe performance penalty, and it makes it very, very difficult for the core to be pipelined for better performance. Intel/AMD design architects have managed it, but the fact is that the x87 FPU in general does not perform as well as the FPU architecture of other RISC platforms (i.e. Alpha, PPC).
Moving on...
MMX is nice, but it only covers integer data streams; it is not designed to handle floating-point numbers (numbers with nonzero digits to the right of the decimal point). AMD cloned the MMX instruction set from Intel, added a few extra MMX instructions of their own, and offered a set of 3Dnow! instructions to handle SIMD tasks with floating-point numbers. Unlike MMX architecture, 3Dnow! architecture merely made use of the floating-point registers already existing in the x87 architecture. This did <i>not</i> mean 3Dnow! could be of use without rewriting code, it just meant that die space was conserved.
Intel responded to 3Dnow! with SSE, a set of SIMD extensions for floating-point which actually utilized a whole new set of conveniently-arranged floating-point registers, rather than use the old stack-based floating-point registers (developers heave a sigh of relief). Now we have SSE2, which is basically SSE with some new (and rather useful) instructions added.
Kelledin
<A HREF="http://www.linuxfromscratch.org/" target="_new">LFS</A>: "You don't eat or sleep or mow the lawn; you just hack your distro all day long."