High-Tech And Vertex Juggling - NVIDIA's New GeForce3 GPU

2 Textures Per Clock Cycle, But 4 Textures Per Pass?

From a brute force hardware point of view, the pixel shader is pretty similar to the NSR of GeForce2. It can fetch two texels per clock cycle, so that if 3 or 4 more textures are used it requires 2 clock cycles. If you combine that with GeForce3's clock frequency of 200 MHz and remember that GeForce3 has four pixel shader units you will come up with a fill rate of 800 MPixel/s for two textures and 400 MPixel/s if 3 or 4 textures are used for one pixel. The texel fill rate is 1,600 MTexel/s in case of 2 or 4 textures per pixel, 800 MTexel/s for one texture per pixel and 1,200 MTexel/s for three textures/pixel. These are the same raw fillrate numbers as found in a GeForce2 GTS.

Besides this similarity, GeForce3's Pixel Shader is however quite advanced over GeForce2's NSR.

While the Pixel Shader might 'only' be able to fetch two texels per clock cycle, it allows up to four textures per pass. This is already an important difference and also shows how misleading raw fill rate numbers can be. GeForce3's NSR can only apply two textures per pixel. If you want to apply more, the pixel has to go through another rendering pass. GeForce3's pixel shader may require 2 clock cycles for 3 or four textures, but still only one pass. Now if you only take the fill rate into account you will come to the conclusion that both situations are pretty much the same. GeForce2's NSR might require 2 passes for three or four textures per pixel, but each pass is done in one clock cycle, thus summing up to two clock cycles, which is identical to what GeForce3's pixel shader requires for 3 or four pixels as well.

The difference cannot be seen if you only count clock cycles or check theoretical fill rates. The big difference is that GeForce3 saves valuable memory bandwidth because it only reads and writes the color value from/in the back buffer and the z-value from/in the z-buffer once, while the two passes of GeForce2 require this procedure twice. If 32 bit color is used and the 3D chips are running at their theoretical fill rate limit (which is of course hypothetical), GeForce3 requires for the rendering of three or four textures per pixel only 2 (1 x read + 1 x write) * 200 MHz / 2 clock cycles * 8 Byte (32 bit color + 32 bit Z) = 1,600 MB/s, while GeForce2 requires 3,200 MB/s. The memory bandwidth doesn't take the texture reads into account, which are identical for both, but they increase the required bandwidth even more. This shows that GeForce3's pixel shader has a significant advantage over GeForce2's NSR once three of four textures are used per pixel. To achieve the maximum fill rate GeForce2 would require 1,600 MB/s more memory bandwidth than GeForce3. Memory bandwidth has a hefty impact on fill rate, as we have pointed out numerous times in previous articles.

Pixel Shader Programming

We have learned that GeForce3's Pixel Shader can also be programmed, similar to the Vertex Shader. A pixel shader program is only able to consist of 12 instructions, four of them can be texture address operations and eight of them blending operations. The pixel shader program reaches the pixel shader after it has been passed through the vertex shader. This enables the vertex shader to supply parameters for the pixel shader programs, as e.g. done for dot product bump mapping with the 'per-vertex dot3 setup' executed in the vertex shader. This is the biggest catch of the pixel shader, as it can be 'driven' by the vertex shader.

A pixel shader program can have three types of instructions:

  1. Constant definitions for parameters, 8 constants c0..c7 are available
  2. Up to 4 texture address operations for fetching texels
  3. Up to 8 texture blending operations, combining texels, constant colors and iterated colors to produce color and alpha
  4. of the pixel

Each texture operation is using a particular set of texture coordinates to

  • look up a filtered texel (classic)
  • use it as a vector
  • use it as a part of a matrix

The following list of texture address instructions should give the interested of you some idea how flexible texture coordinates can be used.

Texture Address Instruction Parameters Explanation
tex t0 Just fetch a filtered texel color
texbem tDest, tSrc0 Bump Environment Map
U += 2x2 matrix( dU )
V += 2x2 matrix( dV )
Then Sample at ( U, V )
texbeml tDest, tSrc0 Bump Environment Map w/ Luminance
U += 2x2 matrix( dU )
V += 2x2 matrix( dV )
Then Sample at ( U, V ) & Apply Luminance
texcoord tDest Just turn the texture coordinate into a color
texkill tDest Kill any texels where at least one of s,t,r,q is < 0
texm3x2pad t1, t0 "padding" instruction as part of the texm3x2tex instruction - performs a dot product of t0's color with these texture coordinates
texm3x2tex t2, t0 Take previous dot product from "pad" instruction as the S coordinate
Perform dot product of t0's color with this texture coordinate and use as T
Sample from a 2D texture using ( S, T )
texreg2ar tDest, tSrc Sample from ( tSrc.A, tSrc.R )
General dependent texture read operations, takes part of a color from the tSrc texture to use as S,T coordinates of the tDest texture fetch.
texreg2gb tDest, tSrc Sample from ( tSrc.G, tSrc.B )
General dependent texture read operations, takes part of a color from the tSrc texture to use as S,T coordinates of the tDest texture fetch.
texm3x3pad t1, t0 Padding for 3x3 matrix operation
Uses the 3D texture coordinate as a row of the matrix
texm3x3spec t3, t0, c0 Compute Non-Local Viewer Specular reflection about Normal from Normal Map
texm3x3vspec t3, t0, c0 Compute Local Viewer Specular reflection about Normal from Normal Map
Eye vector comes from q coordinates of the 3 sets of 4D textures
texm3x3mat t3, t0, c0 Rotate vector through 3x3 matrix, then sample a CubeMap or 3D texture

The two instructions 'texreg2ar' and 'texreg2gb' enable general dependent texture read operations as particularly used for environment mapped bump mapping, which we know from Matrox's G400 and ATi's Radeon. It is now supported by an NVIDIA chip as well.

Create a new thread in the US Reviews comments forum about this subject
This thread is closed for comments
No comments yet
Comment from the forums
    Your comment