Not Just A Compute Architecture
Leading up to the GF100 launch, we’ve heard a lot of buzz about taking this opportunity to deemphasize its role in graphics in favor of GPU computing. Although representatives at Nvidia are the ones who first expressed concern over this “myth,” the company itself is really to blame for its spread.
The Fermi architecture was first introduced at Nvidia’s own GPU Technology Conference in late September of last year and, at the time, only details of the design’s compute capabilities were being discussed. While we had the whitepaper leaked to us prior to the embargo, such a disclosure literally one week after ATI’s Radeon HD 5870 launch was a bit much, since retail product was rumored to be more than a quarter away and AMD was already shipping the world’s first DirectX 11 hardware. Nevertheless, we read with great interest some of the most detailed overviews of the Fermi architecture’s capabilities.
Of course, now Nvidia wants everyone to know that it hasn’t backed down from a dedication to graphics performance, either. The GPC architecture, emphasis on geometry, and full DirectX 11 compliance all support such an assertion. However, it’s still easy to look at GF100 and see where the company sought balance between compute and graphics, from clear nods to double precision floating-point math to the chip’s cache architecture.
Each of the 16 SMs sports its own 64KB shared memory/L1 cache pool, which can either be configured as 16KB shared memory/48KB L1 or vice versa. GT200 included 16KB of shared memory per SM to keep data as local as possible, minimizing the need to go out to frame buffer for information. But by expanding this memory pool and making it almost-dynamically configurable, Nvidia addresses graphics and compute problems at the same time. In a physics- or ray-tracing-based compute scenario, for example, you don’t have a predictable addressing mechanism, so the small shared space/large L1 helps improve memory access. This’ll become particularly notable once game developers start taking better advantage of DirectCompute from within their titles.
|L1 Texture Cache (Per Quad)||12KB||12KB||Texture filtering|
|Dedicated L1 Load/Store Cache||None||16/48KB||Useful in physics/ray-tracing|
|Total Shared Memory||16KB||16/48KB||More data reuse among threads|
|L2 Cache||256KB (texture read only)||768KB (all clients read/write)||Compute performance, texture coverage|
From there you have a 768KB L2, which is significantly larger and more functional than GT200’s 256KB texture read-only cache. Because it’s unified, GF100’s L2 replaces the L2 texture cache, ROP cache, and on-chip FIFOs, as any client can read from it or write to it. This is another area where compute performance is clearly the target in Nvidia’s crosshair. However, gaming performance should see a benefit as well since SMs working on the same data will make fewer trips to memory.
It’s entirely true that GF100 represents a massive step forward in what third-party developers can do with the compute side. But it’s bolstered by the fact that DirectCompute and OpenCL are here, that AMD supports both APIs as well, and that the way we’re going to get more realistic games is through developers enabling GPU computing within their titles. Ray tracing (either used alone or with rasterization), voxel rendering, custom depth of field kernels, particle hydrodynamics, and AI path-finding are all potential applications of what we might see by virtue of compute-enabled hardware.