High-Tech And Vertex Juggling - NVIDIA's New GeForce3 GPU

2 Textures Per Clock Cycle, But 4 Textures Per Pass?

From a brute force hardware point of view, the pixel shader is pretty similar to the NSR of GeForce2. It can fetch two texels per clock cycle, so that if 3 or 4 more textures are used it requires 2 clock cycles. If you combine that with GeForce3's clock frequency of 200 MHz and remember that GeForce3 has four pixel shader units you will come up with a fill rate of 800 MPixel/s for two textures and 400 MPixel/s if 3 or 4 textures are used for one pixel. The texel fill rate is 1,600 MTexel/s in case of 2 or 4 textures per pixel, 800 MTexel/s for one texture per pixel and 1,200 MTexel/s for three textures/pixel. These are the same raw fillrate numbers as found in a GeForce2 GTS.

Besides this similarity, GeForce3's Pixel Shader is however quite advanced over GeForce2's NSR.

While the Pixel Shader might 'only' be able to fetch two texels per clock cycle, it allows up to four textures per pass. This is already an important difference and also shows how misleading raw fill rate numbers can be. GeForce3's NSR can only apply two textures per pixel. If you want to apply more, the pixel has to go through another rendering pass. GeForce3's pixel shader may require 2 clock cycles for 3 or four textures, but still only one pass. Now if you only take the fill rate into account you will come to the conclusion that both situations are pretty much the same. GeForce2's NSR might require 2 passes for three or four textures per pixel, but each pass is done in one clock cycle, thus summing up to two clock cycles, which is identical to what GeForce3's pixel shader requires for 3 or four pixels as well.

The difference cannot be seen if you only count clock cycles or check theoretical fill rates. The big difference is that GeForce3 saves valuable memory bandwidth because it only reads and writes the color value from/in the back buffer and the z-value from/in the z-buffer once, while the two passes of GeForce2 require this procedure twice. If 32 bit color is used and the 3D chips are running at their theoretical fill rate limit (which is of course hypothetical), GeForce3 requires for the rendering of three or four textures per pixel only 2 (1 x read + 1 x write) * 200 MHz / 2 clock cycles * 8 Byte (32 bit color + 32 bit Z) = 1,600 MB/s, while GeForce2 requires 3,200 MB/s. The memory bandwidth doesn't take the texture reads into account, which are identical for both, but they increase the required bandwidth even more. This shows that GeForce3's pixel shader has a significant advantage over GeForce2's NSR once three of four textures are used per pixel. To achieve the maximum fill rate GeForce2 would require 1,600 MB/s more memory bandwidth than GeForce3. Memory bandwidth has a hefty impact on fill rate, as we have pointed out numerous times in previous articles.

Pixel Shader Programming

We have learned that GeForce3's Pixel Shader can also be programmed, similar to the Vertex Shader. A pixel shader program is only able to consist of 12 instructions, four of them can be texture address operations and eight of them blending operations. The pixel shader program reaches the pixel shader after it has been passed through the vertex shader. This enables the vertex shader to supply parameters for the pixel shader programs, as e.g. done for dot product bump mapping with the 'per-vertex dot3 setup' executed in the vertex shader. This is the biggest catch of the pixel shader, as it can be 'driven' by the vertex shader.

A pixel shader program can have three types of instructions:

  1. Constant definitions for parameters, 8 constants c0..c7 are available
  2. Up to 4 texture address operations for fetching texels
  3. Up to 8 texture blending operations, combining texels, constant colors and iterated colors to produce color and alpha
  4. of the pixel

Each texture operation is using a particular set of texture coordinates to

  • look up a filtered texel (classic)
  • use it as a vector
  • use it as a part of a matrix

The following list of texture address instructions should give the interested of you some idea how flexible texture coordinates can be used.

Swipe to scroll horizontally
Texture Address InstructionParametersExplanation
text0Just fetch a filtered texel color
texbemtDest, tSrc0Bump Environment MapU += 2x2 matrix( dU )V += 2x2 matrix( dV )Then Sample at ( U, V )
texbemltDest, tSrc0Bump Environment Map w/ LuminanceU += 2x2 matrix( dU )V += 2x2 matrix( dV )Then Sample at ( U, V ) & Apply Luminance
texcoordtDestJust turn the texture coordinate into a color
texkilltDestKill any texels where at least one of s,t,r,q is < 0
texm3x2padt1, t0"padding" instruction as part of the texm3x2tex instruction - performs a dot product of t0's color with these texture coordinates
texm3x2text2, t0Take previous dot product from "pad" instruction as the S coordinatePerform dot product of t0's color with this texture coordinate and use as TSample from a 2D texture using ( S, T )
texreg2artDest, tSrcSample from ( tSrc.A, tSrc.R )General dependent texture read operations, takes part of a color from the tSrc texture to use as S,T coordinates of the tDest texture fetch.
texreg2gbtDest, tSrcSample from ( tSrc.G, tSrc.B )General dependent texture read operations, takes part of a color from the tSrc texture to use as S,T coordinates of the tDest texture fetch.
texm3x3padt1, t0Padding for 3x3 matrix operationUses the 3D texture coordinate as a row of the matrix
texm3x3spect3, t0, c0Compute Non-Local Viewer Specular reflection about Normal from Normal Map
texm3x3vspect3, t0, c0Compute Local Viewer Specular reflection about Normal from Normal MapEye vector comes from q coordinates of the 3 sets of 4D textures
texm3x3matt3, t0, c0Rotate vector through 3x3 matrix, then sample a CubeMap or 3D texture

The two instructions 'texreg2ar' and 'texreg2gb' enable general dependent texture read operations as particularly used for environment mapped bump mapping, which we know from Matrox's G400 and ATi's Radeon. It is now supported by an NVIDIA chip as well.