G
Guest
Guest
these quotes from ACe's hardware testing on P4
"
Both the P4 and the Itanium load FP instructions from the L2 cache (6 cycles latency), another indication that data dependancies are few and far between. As long as enough data arrives, the FPU can continue crunching and it doesn't matter if it takes a bit longer to get a certain piece of data. Therefore, we say that FP intensive code is more dependent upon bandwidth than Integer code.
The cachemem benchmark in the first article indicated that the latency of DDR SDRAM is about 16% lower than the average latency of dual-channel Rambus (i850). The stream test pointed out that in the best circumstances, the i850 (DRDRAM) can offer almost 80% more bandwidth than the AMD-760 (DDR SDRAM) (1574 vs 889). It is this massive amount of bandwidth that enabled the Pentium 4 to perform twice as well as the Athlon in the FP-intensive Linpack benchmark (Array size > 1024 KB). ""
also mentions
" Intel added two instruction prefixes (to P4): "HWNT," Hint Weakly Not Taken, and "HST," Hint Strongly Taken. As such, the programmer or compiler can lessen the impact of branch misprediction somewhat. But how much?
The hint instructions are useful if (1) there are some branches in which the pattern does not converge well, but some degree of bias does nevertheless exist, or (2) there are so many branches that the CPU is running out of branch predictor resources.
The first can only be easily collected with "feed back directed optimizations," in which you perform a "test run" to collect information about the branch direction, ratio, and how well the predictors can predict them. This data is then used in a recompile.
Intel's latest version of its C++ compiler (5.0) is probably the most advanced x86 compiler on Earth. Intel C++ 5.0 was able to use this method to enable branch hint instructions to boost the Pentium 4's performance in SpecInt. Since these feedback directed optimizations, or profiling, is permitted in both the base and peak CPU2000 benchmarks, we unfortunately are unable to compare the two results to determine the overall impact of this optimization.
Integer applications, which have a rather large memory footprint like many of the benchmarks in the SPEC2000int suite, may perform much better on the Pentium 4 if developers decide to compile with special Pentium 4 optimized options.
Yes, the Pentium 4 has a lot of untapped potential here.
As you can see, all three Pentium 4 systems, from 1.3 GHz to 1.5 GHz perform nearly identically in the swim benchmark, indicating that increased clock-speed, and thus compute performance, is not the deciding factor in this particular benchmark. Further evidence supporting this claim is visible in the 1.2 GHz Athlon DDR's performance, which shows a considerable increase over its SDR SDRAM counterpart. This difference is not as pronounced in the other SPEC benchmarks shown here, but if enough do show this bias towards high-bandwidth memory interfaces, then this could boost the Pentium 4's overall scores considerably. To see more CPU2000 results, broken down into the individual sub-tests, see the Appendix.
P4 1200 AMD 794 "
The third reason, in combination with some SSE2 optimizations (the second reason) might explain why some CPU intensive benchmarks are still running faster on the Pentium 4.
Still, let us take a look at some Specfp numbers on different configurations.
SPEC FP
P4 549
AMD 1.2 359
The results of SpecFP 2000, show that the benchmark as whole, benefits more from increasing the bandwidth to the memory than higher clockspeeds. "
so you can see that the bandwidth and compilers for P4 makes the P4 much faster and this is why some lame testors
do not make fair tests
CAMERON
CYBERIMAGE
<A HREF="http://www.4CyberImage.com " target="_new">http://www.4CyberImage.com </A>
Ultra High Performance Computers-
"
Both the P4 and the Itanium load FP instructions from the L2 cache (6 cycles latency), another indication that data dependancies are few and far between. As long as enough data arrives, the FPU can continue crunching and it doesn't matter if it takes a bit longer to get a certain piece of data. Therefore, we say that FP intensive code is more dependent upon bandwidth than Integer code.
The cachemem benchmark in the first article indicated that the latency of DDR SDRAM is about 16% lower than the average latency of dual-channel Rambus (i850). The stream test pointed out that in the best circumstances, the i850 (DRDRAM) can offer almost 80% more bandwidth than the AMD-760 (DDR SDRAM) (1574 vs 889). It is this massive amount of bandwidth that enabled the Pentium 4 to perform twice as well as the Athlon in the FP-intensive Linpack benchmark (Array size > 1024 KB). ""
also mentions
" Intel added two instruction prefixes (to P4): "HWNT," Hint Weakly Not Taken, and "HST," Hint Strongly Taken. As such, the programmer or compiler can lessen the impact of branch misprediction somewhat. But how much?
The hint instructions are useful if (1) there are some branches in which the pattern does not converge well, but some degree of bias does nevertheless exist, or (2) there are so many branches that the CPU is running out of branch predictor resources.
The first can only be easily collected with "feed back directed optimizations," in which you perform a "test run" to collect information about the branch direction, ratio, and how well the predictors can predict them. This data is then used in a recompile.
Intel's latest version of its C++ compiler (5.0) is probably the most advanced x86 compiler on Earth. Intel C++ 5.0 was able to use this method to enable branch hint instructions to boost the Pentium 4's performance in SpecInt. Since these feedback directed optimizations, or profiling, is permitted in both the base and peak CPU2000 benchmarks, we unfortunately are unable to compare the two results to determine the overall impact of this optimization.
Integer applications, which have a rather large memory footprint like many of the benchmarks in the SPEC2000int suite, may perform much better on the Pentium 4 if developers decide to compile with special Pentium 4 optimized options.
Yes, the Pentium 4 has a lot of untapped potential here.
As you can see, all three Pentium 4 systems, from 1.3 GHz to 1.5 GHz perform nearly identically in the swim benchmark, indicating that increased clock-speed, and thus compute performance, is not the deciding factor in this particular benchmark. Further evidence supporting this claim is visible in the 1.2 GHz Athlon DDR's performance, which shows a considerable increase over its SDR SDRAM counterpart. This difference is not as pronounced in the other SPEC benchmarks shown here, but if enough do show this bias towards high-bandwidth memory interfaces, then this could boost the Pentium 4's overall scores considerably. To see more CPU2000 results, broken down into the individual sub-tests, see the Appendix.
P4 1200 AMD 794 "
The third reason, in combination with some SSE2 optimizations (the second reason) might explain why some CPU intensive benchmarks are still running faster on the Pentium 4.
Still, let us take a look at some Specfp numbers on different configurations.
SPEC FP
P4 549
AMD 1.2 359
The results of SpecFP 2000, show that the benchmark as whole, benefits more from increasing the bandwidth to the memory than higher clockspeeds. "
so you can see that the bandwidth and compilers for P4 makes the P4 much faster and this is why some lame testors
do not make fair tests
CAMERON
CYBERIMAGE
<A HREF="http://www.4CyberImage.com " target="_new">http://www.4CyberImage.com </A>
Ultra High Performance Computers-