Before I start discussing the execution pipeline of Athlon, I'd like to take the time and explain it in some simple words. A good comparison with a processor pipeline might be a car manufacturing plant, that you find so many of in Detroit or around the corner from my home town, in Sindelfingen, where the main Mercedes fab is found. A processor without a pipeline is like a car fab where only one person or team builds a car at a time. The person or team starts building the car, and he/them won't do anything with any other car before he or the team is finished with it. It takes different steps for building this car, but it's always the same guy or team that does it. The effect is, that it's taking a long while for each car to be built, but if a mistake was made somewhere in the building process, it's only one car that was built wrong. Now pipelining is what you find in any modern car manufacturing plant. From when the fist piece of metal is being pressed and welded until the car drives away from the final check, there are a lot of different stages involved, each done by a different person or team. The big advantage of this is that as soon as the first stage of a car is finished, it moves over to the next stage, freeing up stage one for starting to build another new car. This way, the frequency of cars produced is a lot higher, and the frequency that new cars can be started building is just as high too. There's only a nasty problem if it turns out that at some stage inside the manufacturing process something was done wrong, a delay is incurred and the production line is stalled and no more cars are produced.
This analogy shall give you an idea about processor pipelines and the importance of the length of a pipeline. A short pipeline, which equals a pipeline with only a few stages, means that each stage can take quite a long while, so that it cannot be fed with very high speed. It also won't buy you much to put several short pipelines in parallel, since, due to the short pipelines, the execution times of the pipelines can be very different, creating a mess when the executions have to be put in order again. Thus you want a longer pipeline for high clock rates and for good parallelism. However, if a branch prediction turned out to be wrong in a long pipeline, all following OPs depending on this wrong prediction have to be flushed out of the pipeline and it has to be reloaded again, which wastes a lot of time. To make it as simple as possible: You want a long pipeline for high speeds, but don't let it get too long or you get a horrible penalty for wrongly predicted branches.
The integer pipeline of Athlon is 10 stages long, which is considered as almost ideal length for clock speeds of 500 - 1000 MHz. Pentium III's integer pipeline is 12 to 17 stages long and thus more sensitive to wrongly predicted branches. As you can see from the picture above, the floating point pipeline of Athlon is 15 stages long, standing against an estimation of over 25 stages in Pentium III.