R600: Finally DX10 Hardware from ATI

Page 10 of 26:

Memory Read/Write Cache

Among some of the other new caches inside the R600 design, there is a memory read/write cache for the general purpose register (GPR) array. DX10 wanted to "virtualize" any of the available resources thereby making them larger than they were before. Sticking to the theme of virtualization under DX10, ATI needed to virtualize the GPR stack. Under the DX9 API standard, there was only access to 16 or 32 GPRs per thread and R5xx went above that. ATI had to bring in a system of virtualizing its GPR and it did so by creating a bidirectional read/write cache in parallel with the vertex cache and the texture cache. This allows the shader core to actually write to and read back from memory. It can also handle write combining and other enhancements to improve performance. Write combining is the ability to group data together before sending it to memory. It saves on write commands and would be beneficial for the GS Stream Out functionality. Under this new setup, each pixel can have access to 4K 128-bit vectors or 64 kB of data per pixel. With tens of thousands of pixels in flight in the shader core, it would not be possible to hold all of the data. Again, this is why virtualization is so important in DX10.

Stream Out allows something that was introduced in the R500 called render-to-vertex buffer. This can now be done after geometry shading processing by streaming it out directly from the shader. It can write vertex data out of the shader and then circulated through for tessellation or any other extra processing. It can also be done via thread communication. Here one thread can write the data out and have the next thread reads it back in, do a render-to-vertex buffer, or overflow the GS data. This can only be done if the GPR stack is virtualized.