R600: Finally DX10 Hardware from ATI

Z Buffers And HiZ

For this new series, ATI doubled its Z performance up to 32 pixels per clock whether multisampling is on or not. It is only eight pixels per clock for HD 2600 and HD 2400 because there is only one unit compared to four on HD 2900. ATI included new blend and displayable 128-bit as well as 11:11:10 floating point formats for DX10. Under R600, eight multiple render targets (MRT) can be defined simultaneously, which allows the shaders to have multiple outputs. In some cases, data can be saved instead of recomputed on multiple passes thereby reducing the instruction count. Why do something again when you can just use it now and save it for another day? Additionally, data precision was increased for 1K-bit of data between passes (I.E. Float_4 * 8).

Not only has ATI doubled the number of Z buffers but they have made them more complex. Z-range optimization is a part of DX10. This means that you can narrow your Z buffer range to test a specific range, which can be beneficial for stencil shadows. Compression was doubled as well over the previous generation. In standard mode, the HD2000 can handle 16:1 compression versus 8:1 on X1000 and can compress stencil values. It can do up to 128:1 using MSAA 8x. Additionally, support was added for 32-bit float and integer formats. According to ATI, there is almost a 2x improvement to shadow performance on HD 2900 over X1950. One thing that is good to note is that this comment is based on earlier beta drivers. If history continues to repeat itself, ATI will most likely make optimization improvements so we should expect performance to increase with all forms of shadowing, including soft shadows. Currently there are several Catalyst driver versions around. One that is with system integrators is the next WHQL candidate for Catalyst 7.5. That driver should be available around the 23rd.

In addition to hierarchical Z (HiZ), which existed in the previous hardware implementation, Hierarchical stencil (HiS) was added to provide better stencil performance. HiZ was implemented on the Xbox Xenos chip and brought over into R600. Like HiZ, it can cull unnecessary stencil writes.

The Z buffer is separate from stencil. Now Z and stencil are separately compressed for better performance. In past architectures, they were stored together for games like Doom where you need to store both stencil and depth values. Typically, it was stored together but could cause excessive decompression and the new design elevates this issue. The 500 series processors had on-chip storage for the compression data, which told the processor what the compression state was for each section of the screen. The problem there is when resolution went beyond what it had been designed for. The processor would run out of space so not all of the areas could be compressed on the local fixed memory. In other cases, some applications like to create many large Z buffers. To improve on physical limitations, it was all virtualized so it can be on cache, local or system addressable memory. Now blocks are cached for the buffers and regions it is working on. That allows for a high amount of compression that can be done, especially at high resolution. This type of performance should show up above 1600x1200 or in applications when there are a lot of buffers going on. This should be true for color, and more importantly, for Z.

Re-Z

Re-Z is the ability to do multiple Z passes within one rendering. Normally a Z check is done once before or after shading. Now it can be done before and after. This can be necessary as shading can affect the Z value. A pixel might be killed during flight in the shader or if an alpha test is done on the corner of some foliage. When that was detected in the past, ATI only did latency or Z after all of the pixels were killed. Now they can do it before as well and in a non-destructive way by asking if the pixel is going to be displayed. If it is going to be displayed, then the pixel shader will shade it and it can do a Z test later. If it is not going to be displayed then it can be killed. In the case when it is going to be displayed, the Z buffer will not be updated until the process is finished. Basically there is some forward looking to help skip useless shading work as well as to modify the values when applicable. It is a more efficient way than it has been done in the past and translates into what ATI states can be up to 15% faster.