This question is kind of a software question but I think the answer ultimately comes down to a CPU architecture question, so it seems like this is a good place to ask it.
I have written an arbitrary-precision multiplication function and been using it on the Pentium IV for several years. It uses SSE2 instructions for some parts, and on the Pentium IV it is a little more than twice as fast as the best similar function I can write using the general-purpose registers. I recently bought a quad-core Core2 system, and I have noticed that the difference between SSE2-based function on that machine and the GPR-based function on that machine is much less than a factor of two. In other words, the SSE2 and GPR functions run at much closer to the same speed on the Core2, with the SSE2 function being only about 30% faster, not twice as fast. I'm not comparing the Core2 to the P-IV, but rather the incremental improvement of the SSE2 function relative to the GPR function.
The question is why this might be. Did Intel make the GPR instructions on the Core2 better/faster, or did they do something that resulted in a slow-down of the SSE2 instructions? Again, I'm not trying to compare the Core2 to the P-IV, but rather the SSE2/GPR speed on the Core2 relative to that ratio on the P-IV.
It's hard to venture an answer to your question without much more info, for example:
1) How exactly are you measuring the speed of the two versions? At the entry/exit to an assembly-language procedure?
2) Does your routine have sole use of the CPU, or is there an OS that might be interrupting the routine?
3) Is the routine hand-coded assembly language, or a higher-level language that might be generating some unexpected code?
4) Is your routine computation-limited, or might memory use have an effect?
Thanks for the comments. A little more info if it will help:
The SSE2 function mainly relies on PMULUDQ, the packed multiply double-quadword in struction. All data is 16-byte aligned, of course. There are a bunch of SSE swaps and adds to get the partial products aligned for summation.
The GPR function is essentially what you'd expect -- two nested loops with an IMUL and summations.
These are all hand-coded assembly, with very tiny wrappers in C. The testing is done at the C entry point, but there is negligible overhead from that, especially when multiplying really long numbers.
Windows XP is running the show on both the P-IV and Core2. (I of course do not own my computer; Windows does and just lets me use it when it feels like it.)
I've read the document Mondoman cited, and it seems that the Core2 is more efficient overall than the P-IV, having wider memory access, deeper pipelines, and also wider internal SSE2 pathways. But I would have expected the improvements to maintain the performance ratio of SSE2 to GPR instructions; it is very surprising that these two functions, which differered by more than a factor of two on the P-IV, are now so similar.
Does anyone know of any good tools to profile the assembly code on the two architectures to see how they're working?
I am not a software engineer, and your comments show that you have a much more vast understanding than I about the topic at hand. However, I think I'm following along with what you're saying, and I think it all boils down to the classic short commings of the P4 architecture.
My take on your observations is that the Pentium 4 just had a relatively weak GPR. That was the reason why early P4s were being outclassed by pentium III at initial launch; there was no code written with SSE2 at the time to show how strong the pentium 4 could be, so it executed code using GPR and floating point calcs typical of programs of that time.
Your observations also support what is seen in programs with heavily optimized SSE2 code, such as some video encoding applications. In these programs, the Pentium 4 is quite strong. In fact, a Pentium D 930 is as quick as a Core 2 Duo E4300 in a few of these applications, which is quite impressive, considering that in applications that do not utilize SSE2, such as games, it takes the top of the line Pentium 965 Extreme Edition to match the performance of the E4300.
In summary, I'm saying that there was plenty of room for improvement in the GPR of the Pentium 4, and Intel did a damn fine job of remedying this in the Core 2 Duo line of processors. The SSE of the P4 was already quite good, so the Core 2 can only boast a relatively modest improvement.