I have performed my own tests with some hand written assembler code.
You may believe me or you may not, your choice.
I have the same test code written in 32-bit and 64-bit assembler. I ran both versions in Windows x64 on Core 2 Duo E6300 @ 2240 MHz and on AMD Opteron 165 @ 2601 MHz. Clock speeds of both machines are actually irrelevant because I don't measure time but instead CPU clocks per element and I don't have the intention to compare speeds.
Code I used does not have any advantage in 64-bit mode on purpose -- no benefit of more GPR registers, no benefit of more memory, no benefit of 64-bit integer arithmetic, no benefit of SSE code because it was already using it for FPU calculations in 32-bit mode. Moreover, the test is single-threaded on purpose too.
Bear in mind that the code favors Intel architecture a bit although it was written in the Netburst era so it doesn't actually target Core 2 Duo micro-architecture, more like Prescott and Pentium D.
The results are as follows -- both Core 2 Duo and the Opteron have slight penalty for 64-bit code which is expected:
Opteron 32-bit 10.25 clocks per element
Opteron 64-bit 11.73 clocks per element
Core 2 Duo 32-bit 7.49 clocks per element
Core 2 Duo 64-bit 7.92 clocks per element
Apart from the obvious thing that the Core 2 Duo does more using less clock cycles which is due to 1-clock SSE engine (note that I didn't say C2D did it faster, you have to take into account the duration of the clock cycle!) you should notice that Opteron is taking higher performance hit in 64-bit mode. This is probably because as I already said I wrote the code with Intel CPU in mind. But even if you write 10% off of those 14.43% to my coding skills favoring Intel, you still end up with a slowdown.
If you think about it, it is quite logical.
If the code doesn't benefit from 64-bit mode explicitly through:
1. bigger address space (needs more than 2GB of RAM)
2. more GPRs (high register pressure, very rare and almost completely solvable with hardware register renaming)
3. 64-bit integer arithmetic (workarounds again possible)
4. SSE or SSE2 floating point instead of legacy FPU (was attainable with other compilers in 32-bit mode too, even with MSVC 2003)
Then it will only experience 5-10% slowdown.
1. Pointer arithmetic has to be done in 64 bit precision, all immediate operands have to be encoded either with full 64 bits or at least with REX (0x48 byte) prefix. That means lenghtening the instructions thus impairing decoding bandwidth.
2. Accessing additional GPRs also requires prefix, outcome same as in #1. Not to mention using those GPRs requires them to be saved on stack by new register calling convention.
Since 64 bits were tackled onto existing 32-bit CPUs (whoever claims that AMD was designed to be 64-bit from start has been smoking something weird) it is reasonable to expect this slowdown. So if you still want to base your buying decision on marketing hype then go ahead.
Thanks for replying. Levicki's response was more of what I was going for. Toms offers no good benchmarking results for 64 bit processing on core 2 vs a64 so the CPU charts dont help ninja.
I think it would be interesting to do some 64 bit gaming comparisons between a64 and core2 especially counter strike since the source engine supports both 64bit and multi threading now. Ive read the x-bit labs article before and have heard about the error.
The other test I would be looking for or interested in is processing under Vista with 64 bit apps. Theoretically the values should be similar in trend to xp64 edition however who knows what will happen ^_^.
Of course it is because it is the same code written in assembler.
Did you write the code to make opterons look bad?
On the contrary. I wrote the code to maximize the performance for Netburst CPUs back when I worked on that particular problem (3D reconstruction). It isn't even tuned for Core 2 Duo and I wrote that above. Unfortunately the code is not my property, it belongs to the company I worked for so I cannot publish it for you to see. I could tell you roughly how it works if you want.
Have you ever purchased an AMD chip?
I have not purchased it for me but I did it for a few low-end configurations like Barton 2500+ and S939 Semprons I sold.
If you meant to ask if I used other chips than Intel then yes, I used Cyrix in the old days when Pentium was expensive. I don't see why is this important to you?
I have fine-tuned the code for Core 2 Duo (there were some false dependencies and some aliasing) and I have new results:
Opteron 32-bit 10.50 clocks per element
Opteron 64-bit 11.05 clocks per element
Core 2 Duo 32-bit 6.17 clocks per element
Core 2 Duo 64-bit 6.57 clocks per element
Note that the Opteron stands better now too, because I have reduced usage of 64-bit registers in the code which improved decoding bandwidth (less prefixes = smaller and faster code). Difference is now larger for Core 2 Duo because I have managed to speed up 32-bit version considerably.