The dichotomy that is AMD’s and Nvidia’s respective design principles persists in 2010.
The former stands by its “sweet spot” strategy, whereby a reasonably-sized GPU (if you can call a 2.15 billion transistor chip reasonable) serves to address what we’d call the high-end market, while derivatives cover the price segments below. Addressing the more demanding enthusiast community involves multi-GPU configurations—this generation’s example is the dual-GPU Radeon HD 5970.
Meanwhile, Nvidia has another behemoth on its hands. Though the two companies almost certainly count transistors differently, GF100 is said to consist of more than three billion of them, up from the GT200’s 1.4 billion. There’s no word yet how Nvidia plans to implement lower-cost versions of its Fermi architecture—all of the details being released now center on one specific chip—but as you’ll see, the design is deliberately modular. So, whereas all of the GeForce GTX 200-series boards employed one (expensive) GPU, there’s a better chance that this time we’ll see Nvidia do some cutting on lower-end versions.
As with ATI’s Radeon HD 5000-series cards, Nvidia employs TSMC’s 40nm manufacturing process, which has thus far struggled to reach the yield levels needed for AMD to meet its demand. It’ll be interesting to see if the fab’s teething pains affect Nvidia similarly.
And given Nvidia’s cautionary note on power, it’s a fairly safe bet that dual-GPU versions a la GeForce GTX 295 will make way for dual-card SLI configurations instead. Not that we expect Nvidia to need a card with two GF100s on-board. Should the company achieve ~2x the performance of GeForce GTX 285 in today’s games (and it’d appear that, given improvements to texturing/AA, GF100 will see scenarios in excess of a 2x boost), it’ll already be competing against Radeon HD 5970 using one graphics processor.
The Building Blocks
So, why exactly might we suspect GF100 of outperforming its predecessor by such a compelling margin? It’s largely a matter of comparing architectures. Fortunately, the GF100 design is derived from GT200, which itself was derived from the almost-infamous G80/G92. If you’re already familiar with Nvidia’s previous-generation designs, understanding its latest should be somewhat straightforward.
The fundamental building block remains the stream processor, marketed now as a CUDA core. GF100 boasts 512 of these CUDA cores versus GT200’s 240. Thus, clock for clock, we’re looking at the potential for 2.13x the performance of GeForce GTX 285, assuming no other optimizations. However, Nvidia was aware of GT200’s weaknesses in designing GT100, and it claims those have been addressed here with a bit of architectural shuffling. In reality, Nvidia says it’s seeing performance in today’s titles roughly two times higher than GT200 with 8xAA enabled.
GT200 sports 10 of what Nvidia calls Texture Processing Clusters (TPCs), each armed with three Streaming Multiprocessors (consisting of eight stream processors and eight texture address/filtering units). That fundamental organization evolves this time around to include a more elegant collection of resources, from the fixed-function raster engine to as many as four of those Streaming Multiprocessors.
These blocks of logic are divided into Graphics Processing Clusters (GPCs), displacing the TPC concept by integrating functionality that previously existed outside the TPC. Now, one GPC is armed with its own raster engine interfacing with up to four SMs, each SM sporting 32 CUDA cores and four dedicated texture units (along with what Nvidia claims as dual schedulers/dispatchers and 64KB of configurable cache/shared memory). GF100, in its fully-operational Death Star configuration, is equipped with four GPCs.
By the numbers, GT200 actually has more texturing units than GF100 (eight per TPC, up to 10 TPCs per GPU versus four texturing units per SM, with up to 16 SMs). However, the focus here is on augmented efficiency: each texture unit computes one address and fetches four samples per clock. As a result, GF100 achieves higher real-world performance, according to Nvidia.
Scheduling Via GigaThread
The GPCs are fed by Nvidia’s GigaThread Engine. Made kid-friendly by Nvidia’s marketing team, the engine is GF100’s scheduler, responsible for assigning work to each of the chip’s 16 SMs. Yet it establishes itself as a significant component of the Fermi architecture due to its ability to create and dispatch thread blocks in parallel, rather than the one-kernel-at-a-time approach taken before.
Of course, the GigaThread engine fetches its data from the frame buffer. At first blush, the six 64-bit controllers (totaling 384-bits) seems narrower than GT200’s eight x 64-bit (512-bit total) configuration. However, Nvidia is utilizing GDDR5 this time around, yielding a substantial bandwidth increase, despite the less-complex interface. Assuming the same 1,200 MHz DRAMs AMD is using on its Radeon HD 5870, a GF100-based card would have access to 230.4 GB/s of throughput versus the Radeon’s 153.6 GB/s.
The back-end of GF100 is organized into six ROP partitions able to output eight 32-bit integer pixels at a time. This compares favorably to GT200’s eight blocks capable of four pixels per clock. Nvidia maintains one 64-bit memory controller per block, but realizes an overall increase from 32 pixels per clock to 48. Perhaps you noticed that, in our Radeon HD 5870 coverage, improvements to ATI’s anti-aliasing performance over its previous-generation hardware. Meanwhile, the GT200-based GeForce GTX 285 took a more substantial hit as you cranked up AA.
This is another area where Nvidia sought to improve with GF100. If you own a card like ATI’s Radeon HD 5870 or are planning to buy something based on GF100, and are running on one display, then you’re enabling whatever detail settings you can in order to utilize the GPU’s massive performance. To this end, GF100 supports a new 32x coverage sampling anti-aliasing (CSAA) mode that Nvidia demonstrated smoothing out banding issues in foliage generated using alpha textured billboards in Age of Conan. And as a result of its optimizations, Nvidia is claiming a less-than 10% performance hit in going from 8x multi-sampling to 32x CSAA.