AMD Moves onto the Overtaking Lane

Page 2 of 3:

Why AMD's K7 Will Be Intel's Toughest Competitor Ever

The CPU Bus
As already pretty well known, K7 and thus Slot A is not using Intel's P6 GTL+ bus protocol, but Digital's Alpha bus protocol 'EV6'. EV6 has got a lot of architectural advantages over GTL+ already, like e.g. the 'point-to-point topology' for multi-processing, but in case of the K7 it's even running at 200 MHz. This means that it looks as if K7 will be the first CPU that can really take advantage of the high bandwidth memory types like direct RDRAM and DDR SDRAM. Intel's GTL+ running at 100 MHz has a peak bandwidth of only 800 MB/s, at 133 MHz it will have only 1066 MB/s, so that you wonder why Intel's next chipset for Katmai will have direct RDRAM support. Direct RDRAM as well as DDR SDRAM running at 100 MHz offers a peak bandwidth of 1.6 GB/s and this bandwidth is only met by K7's 200 MHz EV6 bus. I guess that AMD will have to thank Intel for pushing direct RDRAM, because K7 seems to be the first CPU that will really need it.
Once again in short: K7's EV6 offers excellent multi processor support, the highest bus bandwidth and is over all superior to GTL+.
L1 Cache
K7 will have no less than 128 kB L1 cache, 64 kB data and 64 kB instruction cache. Pentium II is currently equppied with a quarter of that and it's rumored that Katmai may have at least 2x32 kB and thus half the L2 cache sizeof K7. The large L1 cache is one of the requirements for very high CPU clock speeds, and K7 was specially designed to reach those very high clock speeds.
L2 Cache
K7 will come with a backside L2 cache as known from Intel's P6-architecture. AMD will be pretty flexible with this L2 cache. The K7 CPU has an internal tag RAM large enough for 512 kB L2 cache, but AMD is also planning K7-versions with no less than 2 MB up to 8 MB, using an additional external tag RAM as Intel does in case of the P6 CPUs. The L2-cache speed will range from 1/3 to full CPU speed and it's planned to use 'normal' as well as double data rate (DDR) SRAMs for this L2 cache. The flexible L2-cache design will enable AMD to do the same as what Intel does. There will be main stream, workstation and server versions of K7, determined by the L2 cache size and speed. The K7 will have an address space of 64 GB as Intel's Deschutes core, and Slot A will be limited to 4 GB addressable space as in case of Slot 1. The cacheable limit of K7 will also be the full address space of 64 GB.
Clock Speeds
Dirk Meyer, the chief engineer of AMD's K7, is an ex-Alpha guy. Thus it shouldn't surprise any of us that K7 was designed with very high clock speeds in mind. K7 is already now running at 500 MHz. By the time of the launch of K7 in 1H99 we should expect clock speeds way beyond that. K7 has very deep buffers to enable those high clock speeds, offering up to 72 x86 instructions in flight.
The Floating Point Unit
Haven't we been taught by Intel how important the FPU is all those years? Well, it's looking pretty obvious that K7 will smoke Intel's P6 FPU. K7 offers no less than 3 (three!) out-of-order, fully parallel FPU pipelines. The good old disadvantage of the non-Intel CPUs in terms of FPU-performance will be history with K7. The upcoming seventh generation AMD processor will run CAD or rendering software faster than the Intel CPUs. That is almost a revolution.
The K7 Integer Micro-Architecture
I guess that a discussion of AMD's new features in K7 would lead to far for most of you, but let me still name a few. Three parallel x86 instruction decoders that translate the x86 instructions in fixed length 'Macro-Ops' feed the K7 72-entry instruction control unit. Each of those 'Macro-Ops' can consist of one or two operations. There are two different decoding pipelines that do this job, the 'direct path' decoding common instructions very quickly and the 'vector path', looking up complex x86 instructions in the 'Macro Code ROM' or 'MROM'. The instruction control unit issues the Macro-Ops to either the Integer Scheduler or the FPU/Multimedia Unit. The integer scheduler can hold up to 15 Macro-Op-entries, representing up to 30 operations at a time. Its job is to distribute up to three independent operations to the three parallel integer execution units, each of them accompanied by a address generation unit. The address generation units are responsible for making load/store operations most sufficient, by optimizing the utilization of the L1 data and the L2 cache as well as main memory reads/writes.