3-way Instruction Decoder - It's Not Quite The Same, Baby!
For executing software, the very job of a processor, each CPU begins with the decoding of the program's machine instructions, 'translating' it into operations or OPs that the microprocessor can handle internally. Loosely comparable, AMD calls those OPs 'MOps', for Macro Operations; Intel calls them micro-OPs or short µ-OPs. In fact, AMD's MOp actually contains two operations compared to Intel's one for one uOP. Modern processors can directly decode common and frequently used instructions extremely fast into these OPs, and execute very quickly as well (typically in one clock). Less common or very complex instructions need to be decoded in a slower process, which involves looking up the OPs in a ROM within the CPU, and the amount of resulting OPs is often more than only two. The part of the Athlon decoder that deals with the directly decodable instructions is called 'direct path', the part for the complex instructions is called 'vector path' The P6 architecture (PentiumPro, Pentium-II and Pentium III) is similar but less flexible, using only one path for both types of decodes. Why is all this done? For Athlon it is speed! For the P6, it's design simplicity.
Let's compare the Athlon's decoders to the well known P6 decoders. Intel's Pentium III has three parallel decoding units, they are known as complex and simple and simple. Without going into the boring super technical detail, Intel has strict rules in using these decoders. This means that ideally three instructions can be decoded at the same time, if and only if one of them is a complex instruction and the other two are simple instructions. Intel defines complex as an instruction that can be represented by no more that 4 uOPs. Simple is defined as an instruction that can be translated in to a single uOP. Athlon can also only decode three instructions at the same time, but it comes with three fully capable decoders. This means that Athlon will decode virtually any combination of instructions with any of its decoders. It has no special rules like the P6 architecture. Let's say that this is performance advantage No.1.
Link to AMD Slide from Microprocessor Forum that explains this .