Vista: Benchmarking or Benchmarketing?

Benchmark Basics

There are two basic types of benchmarks: real-world benchmarks, which are based on software or application scenarios that people all over the world actually use, and so-called synthetic benchmarks. This latter type consists of programs that were specifically designed for the sole purpose of stressing a system or specific components for testing. 3DMark and PCMark by Futuremark, and Sandra by SiSoftware are typical examples of synthetic benchmarks. Examples of real-life benchmarks would be a script running several tasks on Adobe's Photoshop, transcoding one video format into another, or running a 3D game.

Both benchmark types return either a composite score or a more tangible result such as the time required to process the benchmark workload. This can also be a certain metric such as frames per second (graphics), calculations or iterations per second (processors), megabytes per second (hard drives or main memory) and many others. Depending on what you're benchmarking, you'll receive one or more results, which have to be weighted before you come to a conclusion. You may want to put emphasis on a certain result, depending on your requirements.

While some synthetic benchmarks, such as 3Dmark, have managed to become an industry standard on a consumer level, we still prefer real-world benchmarks because these typically return more tangible results. For example, 65 frames per second in a particular game at particular settings means more to most users than 6583 3DMark scores. Is this a lot? 65 frames per second certainly is enough to provide smooth graphics. But you could argue that the 3DMark score could be compared across completely different systems easily - because all that counts is the score - while the direct comparison of frames per second leads experienced users to ask about system details.

In any case, benchmark results are only worth the work if there are results to which you can compare. Getting frames per second performance of a brand new graphics card at several resolutions is worth little if you don't also have the results for its closest competitor, and maybe its predecessor. The very first step is creating a consistent test setup, which must not be changed as you alter the target parameters (e.g. varying graphics cards on a test system). It is also helpful to define a baseline to facilitate rating the differences. If 65 frames per second equals a baseline of 100, for example, 73 frames per second would be 112, or a 12% increase.

It is crucial to be diligent, to make sure that all results are actually reproducible. One part of this is to verify results; the other is to find out if a particular benchmark shows a lot of variance. Running each benchmark at least three times turned out to be reasonable, but the more variance you find, the more repetitions you should run. If you still get varying results after ten runs you should probably drop the benchmark and find a more accurate solution. Working with benchmark results, you can either calculate an average, or take the fastest/slowest result. Which approach makes most sense highly depends on the type of benchmark.

Finally, you should be aware that Windows Vista is capable of optimizing application startup time with its Superfetch feature, which pre-caches frequently launched applications into available main memory. This is particularly important for benchmarks that launch multiple modules - think of a benchmark suite such as BAPCo's SYSmark, which launches one application after the other. Vista will recognize the pattern as you run the benchmark multiple times, and it will have the individual applications launch faster, improving the benchmark results over time. Here you should either train each system, or disable the Superfetch service.