Sign in with
Sign up | Sign in

MMX Technology: SSE And 3DNow!

Upgrading And Repairing PCs 21st Edition: Processor Features
By

MMX technology was originally named for multimedia extensions, or matrix math extensions, depending on whom you ask. Intel officially states that it is actually not an abbreviation and stands for nothing other than the letters MMX (not being an abbreviation was apparently required so that the letters could be trademarked); however, the internal origins are probably one of the preceding. MMX technology was introduced in the later fifth-generation Pentium processors as a kind of add-on that improves video compression/decompression, image manipulation, encryption, and I/O processing—all of which are used in a variety of today’s software.

MMX consists of two main processor architectural improvements. The first is basic: All MMX chips have a larger internal L1 cache than their non-MMX counterparts. This improves the performance of any and all software running on the chip, regardless of whether it actually uses the MMX-specific instructions.

The other part of MMX is that it extends the processor instruction set with 57 new commands or instructions, as well as a new instruction capability called single instruction, multiple data (SIMD).

Modern multimedia and communication applications often use repetitive loops that, while occupying 10% or less of the overall application code, can account for up to 90% of the execution time. SIMD enables one instruction to perform the same function on multiple pieces of data, similar to a teacher telling an entire class to “sit down,” rather than addressing each student one at a time. SIMD enables the chip to reduce processor-intensive loops common with video, audio, graphics, and animation.

Intel also added 57 new instructions specifically designed to manipulate and process video, audio, and graphical data more efficiently. These instructions are oriented to thehighly parallel and often repetitive sequences frequently found in multimedia operations. Highly parallel refers to the fact that the same processing is done on many data points, such as when modifying a graphic image. The main drawbacks to MMX were that it worked only on integer values and used the floating-point unit for processing, so time was lost when a shift to floating-point operations was necessary. These drawbacks were corrected in the additions to MMX from Intel and AMD.

Intel licensed the MMX capabilities to competitors such as AMD and Cyrix (later absorbed by VIA), who were then able to upgrade their own Intel-compatible processors with MMX technology.

SSE

In February 1999, Intel introduced the Pentium III processor and included in that processor an update to MMX called Streaming SIMD Extensions (SSE). These were also called Katmai New Instructions (KNI) up until their debut because they were originally included on the Katmai processor, which was the code name for the Pentium III. The Celeron 533A and faster Celeron processors based on the Pentium III core also support SSE instructions. The earlier Pentium II and Celeron 533 and lower (based on the Pentium II core) do not support SSE.

The Streaming SIMD Extensions consist of 70 new instructions, including SIMD floating point, additional SIMD integer, and cacheability control instructions. Some of the technologies that benefit from the Streaming SIMD Extensions include advanced imaging, 3D video, streaming audio and video (DVD playback), and speech-recognition applications.

The SSEx instructions are particularly useful with MPEG-2 decoding, which is the standard scheme used on DVD video discs. Therefore, SSE-equipped processors should be more capable of performing MPEG-2 decoding in software at full speed without requiring an additional hardware MPEG-2 decoder card. SSE-equipped processors are also much better and faster than previous processors when it comes to speech recognition.

One of the main benefits of SSE over plain MMX is that it supports single-precision floating-point SIMD operations, which have posed a bottleneck in the 3D graphics processing. Just as with plain MMX, SIMD enables multiple operations to be performed per processor instruction. Specifically, SSE supports up to four floating-point operations per cycle; that is, a single instruction can operate on four pieces of data simultaneously. SSE floating-point instructions can be mixed with MMX instructions with no performance penalties. SSE also supports data prefetching, which is a mechanism for reading data into the cache before it is actually called for.

SSE includes 70 new instructions for graphics and sound processing over what MMX provided. SSE is similar to MMX; in fact, besides being called KNI, SSE was called MMX-2 by some before it was released. In addition to adding more MMX-style instructions, the SSE instructions allow for floating-point calculations and now use a separate unit within the processor instead of sharing the standard floating-point unit as MMX did.

SSE2 was introduced in November 2000, along with the Pentium 4 processor, and adds 144 additional SIMD instructions. SSE2 also includes all the previous MMX and SSE instructions.

SSE3 was introduced in February 2004, along with the Pentium 4 Prescott processor, and adds 13 new SIMD instructions to improve complex math, graphics, video encoding, and thread synchronization. SSE3 also includes all the previous MMX, SSE, and SSE2 instructions.

SSSE3 (Supplemental SSE3) was introduced in June 2006 in the Xeon 5100 series server processors, and in July 2006 in the Core 2 processors. SSSE3 adds 32 new SIMD instructions to SSE3.

SSE4 (also called HD Boost by Intel) was introduced in January 2008 in versions of the Intel Core 2 processors (SSE4.1) and was later updated in November 2008 in the Core i7 processors (SSE4.2). SSE4 consists of 54 total instructions, with a subset of 47 instructions comprising SSE4.1, and the full 54 instructions in SSE4.2.

Advanced vector extensions (AVX) was introduced in January 2011 in the second-general Core i-series “Sandy Bridge” processors and is also supported by AMD’s new “Bulldozer” processor family. AVX is a new 256-bit instruction set extension to SSE, comprising 12 new instructions. AVX helps floating-point intensive applications such as image and A/V processing, scientific simulations, financial analytics, and 3D modeling and analysis to perform better. AVX is supported on Windows 7 SP1, Windows Server 2008 R2 SP1, and Linux kernel version 2.6.30 and higher. For AVX support on virtual machines running on Windows Server R2, see http://support.microsoft.com/kb/2517374 for a hotfix.

For more information about AVX, see http://software.intel.com/en-us/avx/. Although AMD has adopted Intel SSE3 and earlier instructions in the past, instead of adopting SSE4, AMD has created a different set of only four instructions it calls SSE4a. Although AMD had planned to develop its own instruction set called SSE5 and release it as part of its new “Bulldozer” processor architecture, it decided to shelve SSE5 and create new instruction sets that use coding compatible with AVX. The new instruction sets include

  • XOP—Integer vector instructions
  • FMA4—Floating point instructions
  • CVT16—Half-precision floating point conversion

3DNow!

3DNow! technology was originally introduced as AMD’s alternative to the SSE instructions in the Intel processors. It included three generations: 3D Now!, Enhanced 3D Now!, and Professional 3D Now! (which added full support for SSE). AMD announced in August 2010 that it was dropping support for 3D Now!-specific instructions in upcoming processors.

Ask a Category Expert

Create a new thread in the Reviews comments forum about this subject

Example: Notebook, Android, SSD hard drive

Display all 18 comments.
This thread is closed for comments
Top Comments
  • 13 Hide
    ta152h , October 31, 2013 12:30 AM
    Ugggh, got to page two before being disgusted this time. This author is back to writing fiction.

    The Pentium (5th generation, in case the author didn't know, thus the "Pent"), DID execute x86 instructions. It was the Pentium Pro that didn't. That was the sixth generation.

    CISC and RISC are not arbitary terms, and RISC is better when you have a lot of memory, that's why Intel and AMD use it for x86. They can't execute x86 instructions effectively, so they break it down to RISC type operations, and then execute it. They pay the penalty of adding additional stages in the pipeline which slows down the processor (greater branch mispredict penalty), adds size, and uses power. If they are equal, why would anyone take this penalty?

    Being superscalar has nothing to do with being RISC or CISC. Admittedly, the terms aren't carved in stone, and the term can be misleading, as it's not necessarily the number of instructions that defines RISC. Even so, there are clear differences. RISC has fixed length instructions. CISC generally does not. RISC has much simpler memory addressing modes. The main difference is, RISC does not have microcoding to execute instructions - everything is done in hardware. Obviously, this strongly implies much simpler, easier to execute instructions, which make it superior today. However, code density is less for RISC, and that was very important in the 70s and early 80s when memory was not so large. Even now, better density means better performance, since you'll hit the faster caches more often.

    This article is also wrong about 3D Now! It was not introduced as an alternative to SSE, SSE was introduced as an alternative to 3D Now!, which predated SSE. In reality, 3D Now! was released because the largest difference between the K6 and Intel processors was floating point. Games, or other software that could use 3D Now!, rather than relying entirely on x87 instructions, could show marked performance improvement for the K6-2. It was relatively small to implement, and in the correct workloads could show dramatic improvements. But, of course, almost no one used it.

    The remarks about the dual bus are inaccurate. The reason was that motherboard bus speeds were not able to keep up with microprocessors speeds (starting with the 486DX2). Intel suffered the much slower bus speed to the L2 cache on the Pentium and Pentium MMX, but moved the L2 cache on the same processor package (but not on the same die) with the Pentium Pro. The purpose of having the separate buses was that one could access the L2 cache at a much higher speed; it wasn't limited to the 66 MHz bus speed of the motherboard. The Pentium Pro was never intended to be mainstream, and was too expensive, so Intel moved the L2 cache onto the Slot 1 cartridge, and ran it at half bus speed, which in any case was still much faster than the memory bus.

    That was the main reason they went to the two buses.

    That was as far as I bothered to read this. It's a pity people can't actually do fact checking when they write books, and make up weird stories that only have a passing resemblance to reality.

    And then act like someone winning this misinformation is lucky. Good grief, what a perverse world ...

Other Comments
  • 2 Hide
    k1114 , October 30, 2013 10:11 PM
    Keep it coming.
  • 1 Hide
    renzhe , October 30, 2013 10:39 PM
    9412 pins; imagine that.
  • 13 Hide
    ta152h , October 31, 2013 12:30 AM
    Ugggh, got to page two before being disgusted this time. This author is back to writing fiction.

    The Pentium (5th generation, in case the author didn't know, thus the "Pent"), DID execute x86 instructions. It was the Pentium Pro that didn't. That was the sixth generation.

    CISC and RISC are not arbitary terms, and RISC is better when you have a lot of memory, that's why Intel and AMD use it for x86. They can't execute x86 instructions effectively, so they break it down to RISC type operations, and then execute it. They pay the penalty of adding additional stages in the pipeline which slows down the processor (greater branch mispredict penalty), adds size, and uses power. If they are equal, why would anyone take this penalty?

    Being superscalar has nothing to do with being RISC or CISC. Admittedly, the terms aren't carved in stone, and the term can be misleading, as it's not necessarily the number of instructions that defines RISC. Even so, there are clear differences. RISC has fixed length instructions. CISC generally does not. RISC has much simpler memory addressing modes. The main difference is, RISC does not have microcoding to execute instructions - everything is done in hardware. Obviously, this strongly implies much simpler, easier to execute instructions, which make it superior today. However, code density is less for RISC, and that was very important in the 70s and early 80s when memory was not so large. Even now, better density means better performance, since you'll hit the faster caches more often.

    This article is also wrong about 3D Now! It was not introduced as an alternative to SSE, SSE was introduced as an alternative to 3D Now!, which predated SSE. In reality, 3D Now! was released because the largest difference between the K6 and Intel processors was floating point. Games, or other software that could use 3D Now!, rather than relying entirely on x87 instructions, could show marked performance improvement for the K6-2. It was relatively small to implement, and in the correct workloads could show dramatic improvements. But, of course, almost no one used it.

    The remarks about the dual bus are inaccurate. The reason was that motherboard bus speeds were not able to keep up with microprocessors speeds (starting with the 486DX2). Intel suffered the much slower bus speed to the L2 cache on the Pentium and Pentium MMX, but moved the L2 cache on the same processor package (but not on the same die) with the Pentium Pro. The purpose of having the separate buses was that one could access the L2 cache at a much higher speed; it wasn't limited to the 66 MHz bus speed of the motherboard. The Pentium Pro was never intended to be mainstream, and was too expensive, so Intel moved the L2 cache onto the Slot 1 cartridge, and ran it at half bus speed, which in any case was still much faster than the memory bus.

    That was the main reason they went to the two buses.

    That was as far as I bothered to read this. It's a pity people can't actually do fact checking when they write books, and make up weird stories that only have a passing resemblance to reality.

    And then act like someone winning this misinformation is lucky. Good grief, what a perverse world ...

  • 0 Hide
    Reynod , October 31, 2013 3:23 AM
    ta152h sir you are correct.

  • 0 Hide
    spookyman , October 31, 2013 5:47 AM
    Yes you are correct on the bus issue. VESA local bus was designed to overcome the limitations of the ISA bus.

    As for the reason Intel went with a slot design for the Pentium 2 was to prevent AMD from using it. You can patent and trademark a slot design.

    As for the Pentium Pro, it had issues from handling 16bit x86 instruction sets. The solution was to program around it. The was an inherent computational flaw with the Pentium Pro too.
  • 0 Hide
    Kraszmyl , October 31, 2013 5:54 AM
    I don't think there is a single page that isn't piled with inaccurate or incomplete information.......this is perhaps the worst thing I've ever read on tomshardware and I don't see how you let it get published.
  • 0 Hide
    therogerwilco , October 31, 2013 6:22 AM
    Kinda nice for generic info, was hoping for more explanation of some of the finer points of cpu architecture
  • 7 Hide
    Reynod , October 31, 2013 6:56 AM
    Perhaps the most important thing to note from this is just how clever some of our users are ... so get into the forums and help out the n00bs with their problems guys !!

    :) 
  • 1 Hide
    Sprongy , October 31, 2013 8:20 AM
    Not to be anal but aren't all Core i3 processors, dual cores (2). Some have Hyper-Threading to make it like 4 cores. The last chart above should read Core i3 - 2 cores. Just saying...
  • 0 Hide
    ingtar33 , October 31, 2013 8:37 AM
    Quote:
    Not to be anal but aren't all Core i3 processors, dual cores (2). Some have Hyper-Threading to make it like 4 cores. The last chart above should read Core i3 - 2 cores. Just saying...


    not on mobile. some mobile i3s are single core, same with the mobile i5s... those are all dual core... with hyperthreading.

    there are even dual core i5s in haswell on the desktop. (they are the ones with a (t) after the number)
  • 2 Hide
    rolli59 , October 31, 2013 9:15 AM
    Well although it is full of minor misinformation it is a good insight for a reader that does not know much about it.
  • -1 Hide
    ezorb , October 31, 2013 9:17 AM
    please stop putting this crap on the site, its better to post nothing on a slow Friday
  • 1 Hide
    Nintendo Maniac 64 , October 31, 2013 11:25 AM
    Llano is not based on Bulldozer but rather is based on a slightly improved K10 (typically dubbed "K10.5").
  • 0 Hide
    ronch79 , October 31, 2013 5:50 PM
    Do AMD processors also feature reprogrammable microcode? I'm using an FX-8350 and before it I was using a Phenom II X4 925 (unlocked X3 720).
  • 0 Hide
    turboflame , November 1, 2013 9:45 AM
    Yeah, this wasn't particularly well researched. Quite a few minor mistakes, not to mention it reads like an Intel advertisement, with AMD's contribution to modern PCs being either downplayed or omitted entirely.
  • 0 Hide
    Geef , November 1, 2013 4:16 PM
    After seeing that story they had up a couple days ago about HUBS where the person actually talked about what SWITCHES do, not hubs.
    Since then I make sure I come into Tomshardware articles expecting stuff to be incorrect. It makes me sad, I used to come here for new tech info but now I'm not so sure...
  • 0 Hide
    catfishtx , November 4, 2013 11:44 AM
    I worked for Intel during the time period that they released the Pentium MMX processors. They told us that MMX stood for Multi Media eXtensions.
  • 0 Hide
    falcosoft , November 17, 2013 11:54 AM
    "Note: Most applications that formerly used floating-point math now use MMX/SSE instructions instead. These instructions are faster and more accurate than x87 floating-point math."

    Quite the contrary, x87 CAN BE more accurate than SSE but not the way around. X87 knows and uses 80 bit floating point data internally while SEE (and AVX) can only use 64 bit floating point data. This sentence will be true if 128 bit precision is implemented in the future.