I have performed my own tests with some hand written assembler code.
You may believe me or you may not, your choice.
I have the same test code written in 32-bit and 64-bit assembler. I ran both versions in Windows x64 on Core 2 Duo E6300 @ 2240 MHz and on AMD Opteron 165 @ 2601 MHz. Clock speeds of both machines are actually irrelevant because I don't measure time but instead CPU clocks per element and I don't have the intention to compare speeds.
Code I used does not have any advantage in 64-bit mode on purpose -- no benefit of more GPR registers, no benefit of more memory, no benefit of 64-bit integer arithmetic, no benefit of SSE code because it was already using it for FPU calculations in 32-bit mode. Moreover, the test is single-threaded on purpose too.
Bear in mind that the code favors Intel architecture a bit although it was written in the Netburst era so it doesn't actually target Core 2 Duo micro-architecture, more like Prescott and Pentium D.
The results are as follows -- both Core 2 Duo and the Opteron have slight penalty for 64-bit code which is expected:
[code:1:4276bbdfc7]
Opteron 32-bit 10.25 clocks per element
Opteron 64-bit 11.73 clocks per element
14.43%
Core 2 Duo 32-bit 7.49 clocks per element
Core 2 Duo 64-bit 7.92 clocks per element
5.74%
[/code:1:4276bbdfc7]
Apart from the obvious thing that the Core 2 Duo does more using less clock cycles which is due to 1-clock SSE engine (note that I didn't say C2D did it faster, you have to take into account the duration of the clock cycle!) you should notice that Opteron is taking higher performance hit in 64-bit mode. This is probably because as I already said I wrote the code with Intel CPU in mind. But even if you write 10% off of those 14.43% to my coding skills favoring Intel, you still end up with a slowdown.
If you think about it, it is quite logical.
If the code doesn't benefit from 64-bit mode explicitly through:
1. bigger address space (needs more than 2GB of RAM)
2. more GPRs (high register pressure, very rare and almost completely solvable with hardware register renaming)
3. 64-bit integer arithmetic (workarounds again possible)
4. SSE or SSE2 floating point instead of legacy FPU (was attainable with other compilers in 32-bit mode too, even with MSVC 2003)
Then it will only experience 5-10% slowdown.
Why?
Because:
1. Pointer arithmetic has to be done in 64 bit precision, all immediate operands have to be encoded either with full 64 bits or at least with REX (0x48 byte) prefix. That means lenghtening the instructions thus impairing decoding bandwidth.
2. Accessing additional GPRs also requires prefix, outcome same as in #1. Not to mention using those GPRs requires them to be saved on stack by new register calling convention.
Since 64 bits were tackled onto existing 32-bit CPUs (whoever claims that AMD was designed to be 64-bit from start has been smoking something weird) it is reasonable to expect this slowdown. So if you still want to base your buying decision on marketing hype then go ahead.