Pentium 4 FPU (long and technical)

The following is informed speculation on the Pentium 4's FPU performance based on publically available developer documentation provided by Intel.

I've just read Tom's latest benchmarks and comments on the Intel Pentium 4's FPU - the results are disappointing but, in some ways, not entirely surprising. Let me explain:-

I work as a developer for music software company FXpansion Audio; in our corner of the industry, FPU is the_most_important factor determining software performance, and our code tends to have a lot of hand-optimised x87 FPU assembler code.

When writing this code, we of course target the P6-core; most of the time, the Athlon's characteristics are similar enough that with its better overall performance it can equal the P6-core clock for clock even on P6-targeted code.

However, the P4 is a very different kettle of fish, and here's why. Floating point uses, mostly, the FADD/FSUB and FMUL instruction; others like FDIV are used as little as possible, as they're slow anyway. It's FADD and FMUL performance that's the key for fast floating point.

Now, these instructions are quite complex, and it takes several clock cycles before the result of a calculation pops out the other end of the FPU pipeline ready to be used. For example,
a = a + (b * c)
The processor can't do the "a = a +" part - a FADD - until it knows the outcome of (b * c) - a FMUL. The time it takes for this outcome to be known is called latency, and is measured in clock cycles. However, while this is happening, it _can_ do _other_ calculations, which don't need that result (like "e = (f * g)") - that's what pipelining is "all about".

For a PPro/2/3 (P6-core), the FMUL latency is 5, and the FADD latency is 3. So, a typical piece of P6-code might be

temp = b * c // FMUL, latency 5
a = a + temp // FADD, latency 3
store (a)
. = 9 clocks
A good programmer will be able to get the most out of the processor by getting it to do other useful things while waiting for the result of (b*c) and (a = a + temp).

Now, let's take a look at the P4. Its longer pipeline, made up of more, simpler stages, allows much higher clock speeds than a P6-core. But, because each stage does less, the instructions need more stages to get through the pipe, hence longer latency. In the P4 FPU, FMUL has a latency of 7, and FADD has a latency of 5, versus 5 and 3 on the PIII.

So here's the same code, executing on P4:-
temp = b * c // FMUL, latency 7
a = a + temp // FADD, latency 5
store (a)
. = 13 clocks

So, first thing to note is that the code takes longer to execute. Normally, this isn't actually as big a problem as you'd think, as you can "fill in the blanks" with other useful things - the =overall=throughput=, in terms of operations-per-second, is just as high, because the chip can still execute one instruction every clock (a gross oversimplification, but illustrates the point) - but look at those rows marked *****.

The point is, in this piece of P6-code, the programmer has arranged 4 clocks' worth of work to do while the FMUL result is being calculated, and 2 clocks' worth for the the FADD. But the P4 needs longer than this, so is _idle_, or stalled, for two clocks at the end of the FMUL, and two clocks for the FADD.

Conclusion: on tightly optimised P6-core code, targeted at Pentium Pro/2/3 chips - as most high performance FPU code is - the FPU core of P4 will run up to 50% slower than the equivalent P3, clock for clock!

The less tightly targeted the code is (and hence, the more leeway it gives the processor's own scheduler), the less bad the problem. But for benchmarks like MPEG4, where every instruction counts, the P4 simply stinks at running P3 code.

The good news for Intel is that for =most= code, it can be rewritten in such a way as to minimize this performance hit. With a bit of care, the performance difference versus a P3 clock-for-clock can be reduced to something less than 10% in many cases, which will be enough to save P4's bacon once its other advantages are taken in to account.

The question, however, is - will anyone bother? Code specially targeted to the Athlon, as opposed to the usual P6-code which just happens to run OK, can be up to 50% faster than =either= Intel processor, clock for clock. If Athlons continue to sell well and P4 doesn't, well, you can see which way it will go.

I, for one, hope Intel wins this out. For two reasons - first, the P4's memory subsystem kicks huge quantities of butt, and the overall design has headroom to hit huge clockspeeds. And second, because an uncompetitive Intel churning out P4's slowed to a crawl by lack of code support, would be as bad news for us performance freaks as an uncompetitive AMD in the days before Athlon. Fast P4's = faster/cheaper Athlons = happy users!

== ==
18 answers Last reply
More about pentium long technical
  1. I was actually able to follow that (as much as "highly technical" mark scared me:)
    I think I even now understand the SPARC assembler (which they make us study at University) better, oddly enough, and why I can and should put some instructions after other:)

    Tehnical side aside, I highly believe in the last thing you said -- the more competitive BOTH CPU companies are, the better we consumers are off:)

    Silly question: The improvements in new processors, and here I'm talking in terms of big picture, seems to come from a) simple Hz increase b) actual changes in architecture

    Lets assume theoretically that we don't have heating problems, and please help me understand why any given processor couldn't be speeded up simply by upping the Hz? for example, a 1GHz 286 would obviously be slower than PIII at 1GHz, since PIII can do MORE during each clock cycle, is better designed etc... but shouldn't 286s speed also be increased, practically indefinitely, by just upping the clock cycle itself, (again, ignoring the heat issues)?
  2. AngusH thanx for explaining this hole thing.

    I think everybody can see now that you if you want to buy a cpu you should but an Tbird.
    And if you want to buy a cpu next year you have to look at toms's H-side witch company makes the best cpu at that time
  3. zwaarst - damn right. But let's hope enough (ignorant?) people buy P4's to encourage Intel to make faster, improved designed chips, and to encourage developers such as myself to support the processor.

    Kodiak - a number of reasons. I'll cover a few of the more obvious ones.

    First - process technology. This is as big an issue as heat - even with supercooling, all processors have their limits, but a .18micron implementation will hit higher clocks than a .25micron. But, let's write that off and talk about an imaginary .18micron 286 design. That'd go fast, right? Well, no.

    The second issue is cache. Back in the 286 days, the speed-limits of DRAM and of motherboard circuitry weren't such a problem - the RAM could run synchronous with the processor clock, or at least at half the speed, and chips with high transistor counts were expensive/impossible to make. So, the 286 has no L1 or L2 cache - it simply didn't need it! And, its system bus clock was identical to the CPU-clock, at 12MHz or so (indeed, all processors until the 486DX2 ran like this). Problem is, after 33MHz, designing motherboards to support that gets difficult (the first 50MHz 486DX boards were notoriously unreliable, which is why they made the 486/66 to run on a 33MHz bus, hence the DX2), and even with modern techniques 7 or 8 years on, they've still hit a limit of 100-133MHz for the main motherboard circuitry (the FSB, and the lines used by Rambus memory, are a special case).

    So, the system bus of the 286 couldn't work at high speeds, and even if it did, the processor would spend the whole time waiting for the RAM.

    Finally, there's the architecture - because of the way the 286 is put together, signals simply can't travel through the silicon fabric fast enough to allow it to run at more than 25MHz or so - to get any performance out of speed increases > 33MHz (and not just wait idle for tens of clocks for electrons to creep slowly through the silicon), you need pipelining - hence the 486, Pentium and all subsequent designs are pipelined, with each generation having a deeper pipe. To get much more than 1GHZ (with current technology), you need a very long pipeline, hence Pentium 4.

    == ==
  4. Hehe, funny that you say that you hope people are willing to buy P4's to support Intel.

    Back in the K6 and K6-2 days, I was trying to drum up support for AMD because I felt like 2nd best tries harder. I knew those processors weren't as good as Intel's, but I thought that AMD deserved a chance. And fortunately, enough people thought like me that AMD's fortunes have reversed. So I can understand and respect your point, although I find it funny that Intel would ever need grassroots support :)
  5. I have to agree with AngusH that a good compatision is good for the custemers, But if the P4 is becoming a succes with the majoritie of the peoples (ignorant peoples), I am afraid that AMD is also going to make sucking CPU's to get the Mhz's up.
    And I think that that is not what we want.
    Let's hope both companies become the same size and going to fhight each other with performence CPU's and low prices. (ideal situation)
    Here in the netherlands I am still going to pay 20% taxes but the CPU's will become a little bit cheaper

    sorry for my bad englisch, (never been good at it)
  6. I think that the success of the P4 is under way, "ignorant people" often (allways?) mix Mhz speed with quality benchmark. I know it because I was like that before ant it took time to make me understand that a DX4-100 was not better than a P-75. If the P4 clock speed is generally higher than the athlon many people are going to buy Pentium 4 without hesitation.
  7. Even tough I haven't had the time yet to read the P4 papers, I have read all the older P I,II,II docs, but everything you (AngusH) wrote seems correct. My only problem is, that i have been debuging tons of code, and I have almost never met code that was that optimized. Even considering the bad-old intel MMX doc telling us the old gold rule that only 10% of the code needs to be optimized cause it runs 90% of the time. Nowadays when I look at code, it mostly doesn't contain any hand coded asm, rather old bummer style code like mov eax,0 mov edx,0 or fdivp st1 fdivp st1 fdivp st1. (ohhh... even i hadn't coded asm in a time) And I haven't even mentioned the MS Visual crap hand optimized libary functions, which really made me sick. (M$ coders seem to be plain stupid or sadistic at best).
    And then we have several algorithims that can't be really parallelized. But anyway, if software architects get the job done, then P4 will really shine, but based uppon the code I see this is highly unlikely...
  8. I for one really don't like the P4 yet. but here is the deal.

    The p4 can be best described as a high latency CPU, this meaning it take longer for and instruction to enter and leave the cpu. for most simple code though this is not a problem as data is very liniar.

    the exmaple above a=a+(B*c) which could be described as a kinda of feed back code. This kinda of situation is very unfriendly to the P4 and will contuinue to be unfriendly to any cpu that increases the pipeline stages. Lucky for most users, this kinda of code is not typical, it does not exist in games like quake 3 and your average appliction software. Code like this usually is only found in the most complex of scientific operations. and for these things the P4 will suck, and chances are the P5 will suck even more.

    What the P4 is telling us is the sort period where consumer cpus where the same chips that could power workstatations is coming to an end. Comsumer chips are going to be high Mhz parts while workstation chips are going to go the low latency/low Mhz route. Funny how Itanium seems to fit into this quite well

    Amd might have to follow the same route I am sorry to say.

    Summed up, the P4 will run most simple code very effectivly when code is optimized for it, examples of simple code would be QUAKE3, Net/file servers, Video, internet, or any code that works on large amounts of data with small code and does not involve the feed-back kind of problem. the kinda of code we can not expect the P4 to run effectivly is anything where the feed back problem comes into effect, exmaples of this would be engineering software, physics simulations, stress simluation, or anything along those lines
  9. >please help me understand why any given processor couldn't be speeded up simply by upping the Hz?

    You can speed up the processor as you say. But since the
    processor is connected to other things, they would have to speed up in proportion. That has been impractical at every stage where a new processor (x86) has been introduced. In particular, memory speed had not kept pace. To cope with that, x86 design has intcorporated many things. Today's processors essentially operate from a small amount of memory on the chip itself, a cache, which is kept loaded (hopefully) with the stuff the processor needs to do its work.
    A given "chip process" is capable of producing transistors that will switch on and off at some maximum rate, after which the transistors do not fully switch on or off. More than that, there must be some period of time when the transistor is stabily in the on/off state. When in the stabile state, info can then go from one stage within the processor to the next. In order to "do" one CPU instruction, in reality many basic thing must be done. Passing from one stage to another reliably requires one stage to be unchanging for some period of time, while the next stage is accepting the data. It is possible to run the data through all stages in one step (a clock), but every stage requires a propagation delay, so the clock time must be long enough to go through all stages. This is undesirable if you want to achieve the maximum clock rate possible. If you instead pass data from one stage to the next at each clock you achieve the maximum clock rate. You can get the highest clock rate by doing as little as possible in one step! Then you "pipeline" the stages so that you can start the nest instruction going in just after the previous has gone to the next stage, not waiting for the whole thing to get through every stage.
    The memory speed is nominally 133MHz currently, but that is the clock of the memory, and actually getting data out of the memory requires, on average, more than one clock.
    Let's skip over that. We might have a 133MHz 286. The 286 used 16 bit memory (I think), while today the memory is 64 bits wide. With a bit of interface the 286 could operate at 533MHz with no cache. Impressive? I think the 286 used 3 clocks for a short instruction, and 7 clocks for an instruction that accessed memory. Current CPUs can do 2 instructions per clock even if it accesses memory (since it really only accesses the on-chip cache.) They also operate on 32 bit data, while the 286 used 16 bits. Discounting the clock rate, todays CPUs are about 12 to 28 times faster. So do programs run 2800 times faster? Hmmm... It's hard to say, because very few people spend any time running programs. We spend almost all our time running the GUI(Windows).

    A word about heat. I was once under the impression that overclockers used heavy duty heat removal to protect their chips while they abuse them, but this is not so. Today's CPUs can take a lot of heat. It is my impression that they can get hotter than a 286 could without damage. The problem CPUs have is that they operate differently and slower as they heat up. Hot transistors switch slower than cold ones. There is some physical law which specifies this, I think. Reduce the temp 100 degrees and the transistor swiches state much faster.
    I stopped the fan on my CPU while it was at its nominal speed of 800MHz and it kept running prime95 without a glitch
    until I chickened out at 155F. At 970MHz, prime95 and Win98 locked up at 113F.
  10. Yes, thay can take more heat than they did in the 286 days. the day may come when chips are designed to run at very hot tempatures and may not function at low temps anymore thus requiring a warm up time :) just speculation of course, but if process technology does not keep up with keeping them running cool, this is the way we will be heading
  11. I think Tom has not fully concluded that P4 sux as he said that he will post more benchmarks later. The P4 has huge potential but it will take time for software to be optimized for SSE2 so Tom said that it is not advisable to get a P4 NOW(considering the price) until SSE2 gets more support
  12. I think P4 does suck. You need a new MB, special power supply, and special case for it! Easy upgrade! And you must recompile your software because it is not P4 optimised. If you agree to change _EVERYTING_ in your computer, (and maybe all your habbits too) each time you make an upgrade you better buy a Mac ! I heard it works better than PC (-:
    -=sorry about bad inglish.
  13. ok, 1st off i'd like to say how impressed i am by the maturity and usefulness in all of your posts!

    Anyway on the subject of the sheep going out a buying the P4 mainly for its name and level of MHz. Well i think there is more to that than just the consumers being uneducated. Its also the sellers, the guys trying to sell that P4 working on commision over at the computer chain store. Realistically most of those sales people don't know Jack. They know all the words, but they don't know what they mean and all of the technical junk behind the words & phrases. I mean how hard is it to convince a small child that there is a guy at the north pole who gives out presents on december 24th? Not hard, and why? cause they don't know better, they are educated. Its the same case with the consumers, and when you have commision driven sales people its not surprising that the P4 will sell just fine.

    About the 286 being able to run at any givin speed, heat not being a factor. Since the chip has no form of cache (L1 or L2) the chip can't do the required buffering required for a cpu multiplier of more than 1. for example you have a P3 500 w/ 100mhz bus the cpu multiplier is 5. How do you think that the bus on the mother board at 100mhz doesn't hinder the overall performace of the computer? Its becase the fast cache can buffer lots of info. there seems to be a point where a cpu starts to be held back by high cpu multipliers. On the ahtlon's It seems to be anything about 12.5. Once you hit that point, speeding up the chip doesn't increase the system speed in a linear fashion, thus even if you could get a 286 at 1 GHz or so, it may only run as fast as a Pentium Classic or 486. If you don't believe me then consider this... Would it be harder to overclock a cpu by increasing the bus speed or cpu multiplier? the bus speed means everything else runs faster. So it seems that the cpu multiplier would be much easier. Why don't motherboards come with multipliers that go up to 20 or so? Because even if you could get the chip at that speed, the chip would be slowed down by its relatively small ammount of cache.

    So, unfortunetly a 286 won't be able to play Quake, even tho it'd be beyond cool to see one do that!
  14. AFAIK the Athlon is particularly designed to withstand high temperatures - as much as 90 deg C whereas the P3 should be kept below 60 degrees. In general it is best to keep temperatures down as this improves stability and allows higher clock speeds. Most very high frequency solid state equipment uses supercooling to function and the same rules apply to chips. The fastest standard T'Bird I ever heard about ran at 1.5GHz but needed liquid nitrogen to keep the core temp below -100. This is why thermoelectric (Peltier) coolers are also very popular among overclockers as they provide an extreme cooling solution with the advantage of a small size and no moving parts.

    I think we will see these technologies being used in desktop PCs in the next few years as clock speeds and CPU power requirements increase further. Just consider how modern CPUs require half a pound of copper and high speed fans just to keep coll when it wasn't that many years ago that chips didn't even have heatsinks. I will be very interested to see how hot these things are when they reach speeds of 2 or 3GHz, even with a die shrink.
  15. On the question of if a 1G 286 would be almost as fast as a newer chip:
    A lot of the replies following touch on the issues of why a 286 wouldn't be as fast a current processor, most of them dealing with the fact that the rest of the system really wouldn't be able to keep up. Architechual differences in the newer processors help to prevent the slowness of the system from bottlenecking the processor, so even in ideal world of no heat - it wouldn't quite be as fast as a modern chip. However, it would run Windows 3.0 blazingly fast.

    On the more techinical side of things:
    Heat - where it comes from / what it does:
    In electronics, heat is generated, in a sense, by the friction of electrons moving down the circuits. Like all friction, the more and faster you move, the more heat is generated. This means that a 5V processor cruising along at some hellacious speed would generate a lot of heat. This would be newer processors are moving to 2.5 volts and less. Smaller traces (.13 micron vs .18 micron) within the core also kind of indicate that there is less to bump up against, thus lower heat production per trace, but you tend to get more traces in the same space. Silicon is also a semi-conductor. It has a problem of increasing resistance as heat increases, thus requiring more power to continue switching between high (say 5V) and low (0V) signals. This in turn generates more heat - so the problem gets worse. Damaging a processor by heat occurs when you get to the point where you are melting the silicon, and possibly altering traces.

    Overclocking - why you can normally do it without damage:
    Heat is what stops you overclocking success. Some processors thermal diodes which disable the chip if the heat is too high, so they actually have protection circuits to save the chip. As you may have seen before, many processors are built on a single silicon die, and then marked for speed. So it is possible that your 800Mhz chip was actually built to be a 1G chip - it just didn't pass all the tests (at 1G, but did at 800Mhz), and got marked slower. The final part is the resistance of the chip affects what the signal looks like. In the digital world, ideally you have a square wave form in the electrical signal ( you only get low and high signals, nothing in between). As resistance increases (due to heat, usually), that ideal signal becomes a bit rounded, and may get to the point where it never quite reaches the high level (and possible the low level). Now instead of having nice clean high and low signals, you have inbetween levels, and the next piece of the processor which expects a high or a low get an 'I don't know' - and thus stops working. This will lock up the processor without actually physically re-arranging it guts in the process.

    Pipelining - more efficient processing:
    Pipelining is a way of making what appears to be a simple instuction into several even simpler instructions to lower the apparent time needed to run an instruction. An 'ADD' is a simple instruction. Take two numbers, and put them together. However, the processor needs to get the instruction, read both the numbers, add them together, and return the result, then get the next instruction. In this sense, the processer needs to do five things per instruction, and thus per clock cycle. Now what if you broke this simple instruction up into it's five pieces, each of which can be run faster than the whole. Now it takes 5 clock cycles, each of which can be 1/5th of the length, to do the same instruction. This doesn't seem like much of a savings until you realize that you can stack instruction into the pipe, so that your while your first 'ADD' instruction is getting it's numbers, your second 'ADD' can be read from memory, and so on into the pipe. What this means is that when you start, your first 'ADD' takes a full five clocks to complete, but your second 'ADD' completes one clock cycle later, in what appears to be 1/5 the time. Thus, in a sense, you can crank that processor up to 5 times the speed. Deeper pipelines can be used to get better results as long as you can find shorter, simpler pieces of each instruction to break them up. However, if you use a branch or jump command (you changed what the next instruction should have been), the processor finds out the result only after that instruction is done - thus your pipeline is filled with a bunch of instructions that should never have been run. This requires a pipeline flush, and you lose the benefit of the pipeline design. Luckily, most programs operate linearly, so the rewards are worth the occasional flush.

    Logic Units - the processor is more complex:
    Now, from the article it talked about different lenght instructions, say an ADD takes one cycle, while a MUL(tiply) takes 5. This discontinuity works becuase a ADD command works on one Logic Unit (LU) while the more complex MUL command works on another LU within the same processor. The simple form of this is the standard test of Integer speed VS FPU (floating point) speed. In reality, most moden processors have multiple ALU (Arithmetic Logic Unit, Integer processors) and FPU (Floating Point Unit, decimals), so the concept of instruction(s) per clock cycle becomes a major challenge to describe. These parallel LUs are also used for Out-of-order execution, branch prediction, and other complex speed enhancing (hopefully) techniques, and are a little different from processor to processor. So for MPEG encoding, a good programmer will try to make use of the fact that the MUL will take 5 cycles, and as such, you can use the other LUs to do work while waiting for the result. Thus, you optimize your code to feed instructions into the processor in an efficient way so that it never has to wait for something to finish.

    Why Change the design:
    Hopefully you have thought enough about this to wonder why Intel would be deepening it's pipeline. Obviously (I hope this is obvious to you), code not optimized for the deeper pipeline would not run as fast, as the processor would be waiting around for it's guts to finish something else before reading the next line of code. In all truth, with higher clock speeds generating more heat and requiring more complex timing (synchronizing all those different LUs and doing branch predictions can be a mess) - simplifying the 'pieces' of instructions and thus building deeper pipelines is a good solution. What Intel is building is a more 'parallel' design of a single processor - in a sense, making an SMP (symmetric multi-processing) machine out of you single processor machine, without picking up some of the other associated headaches in dealing with SMP. In all truth, back in the days of the 286 you had one simple little processor - but now, you have this whopping complex processor that takes up approximately the same amount of space and runs rediculously faster. By increasing the pipeline, you are locking your maximum clock speed to the shortest piece of instruction you can break you large instructions in to. Even in RISC - those simple instructions can have multiple pieces. The pipeline is the 'divide and conquer' theory of computing. Besides, if you look at history, every major jump in processing power has also included addition of newer instuctions - MMX, SIMD, 3DNOW. These require new compilers and new optimizations in software (thus new programs) to really show off what they can do. The only difference in the Pentium 4 release from the other Pentium releases is that it may well be the first to run older code slower - but in the end, it should run the next generation faster.

    Final comment:
    Intel's P4 isn't the only chip in this boat of changing architecture and performing trade-offs in processing speed vs clock speed. With the advent of any newer processing core, be it Intel / AMD / Alpha / Motorolla / etc - some engineer is trading the speed of one type of code for another. The hope is that as newer software pops out - your design will win in the end. Intel is betting on it with the P4 and IA64. AMD will be betting on the change with the x86-64. Motorolla made the switch going to the Power PC line (Mac users ended up with emulators for the 68000 line, and had to buy new software to see the speed). Intel is just getting really good at shoving their more controversial new designs out into the public where they can take some flak for it. Hopefully the controversy / competition will, in the end, just mean bigger, better toys for the rest of us.
  16. ... or you could just get a PPC or Power CPU that does the floating multiply-add in 1 clock and can be pipelined. ;)
  17. same reason for why a pentium 66 is faster than a 486dx4-100.

    <font color=orange>What do you think? :wink: </font color=orange>
  18. Angus, Is it my imagination or did you copy/paste your post from somwhere else? other that your blub at the top, I did read that somewhere else. GG

    I will find out if our NDA covers apps developed by other companies. I will post a list of applications currently being coded for SSE2 if it doesnt violate the NDA.

    With a list of SSE2 apps the "ignorant people" will realize that most companies are about to release SSE2 optimized versions. Lets bench SSE2 optimized code on AMD. HAHAHAHAHAHAH maybe next year.
Ask a new question

Read More

CPUs Pentium Performance