Pixel Processing Engine
Each of the 16 pixel pipelines sports two shader units (superscalar design) and one floating point texture processor. NV40 also comes with four L1 texture caches, each of which serves four pipelines. A large L2 cache also helps to additionally unburden the memory interface. The architecture of the shader units follows a True SIMD (single instruction, multiple data) design. While the first shader unit of every pipe can handle arithmetic operations as well as texture reads and normalization, the second unit is limited to arithmetic. In other words: Shader unit 1 is "shared" with texture. When not texturing (for that pass) it is available for pixel shading. Shader unit 2 is always available for pixel shading.
Simply put, the NV40 usually acts like a 16-pipe design (16x1 - classical texture mapping with Color & Z). Imagine for example a shader with an arithmetic to texture ratio of 4:1. In such a scenario, Shader Unit 1 could spend 75% of its time during the passes on arithmetic, while Shader Unit 2 does 100% arithmetic. In this example, one pixel pipe can calculate 7 ops/clock.
In the case of shaders, we have to differentiate between instructions and operations (ops). Instructions define functions that are supposed to be applied to certain components (R,G,B or alpha) of a pixel. The shader units then carry out their calculations (Ops) according to these instructions.
NV40 is able to carry out 4 or more instructions per pixel and 8 or more operations per pixel and clock cycle. According to NVIDIA, ATi's R3xx series can only carry out 2 instructions per pixel and a maximum of four operations per pixel and clock cycle.
In short, I think it's safe to say that the NV40's pixel shader engine is blazingly fast and highly efficient.
Here's a summary of the new features of the pixel shader engine:
- Full Support for shader model 3.0
- 216 (65,535) length pixel shader programs - shatters PS2.0 limit of 96
- Dynamic Flow control - Loops & Branching, Call & Return, Subroutines
- Highest precision pixel shading - Native/optimized FP32 processing
- Flexible data type support - FP32, FP16 operand & texture formats
- Full speed non-power of 2 textures with mipmapping
- Multiple Render Target Support
- Centroid Sampling AA Support