R600: Finally DX10 Hardware from ATI

Command Processor (CP)

According to ATI, the command processor (CP) for the R600 series is a full processor. Not that it is an x86 processor but it is a full microcode processor. It has full access to memory, can perform math computations and can do its own thinking. Theoretically, this means that the command processor could offload a work from the driver as it can download microcode to execute real types of instructions.

The CP parses the command stream sent by the driver and does some of the thinking associated with the command stream. It then synchronizes some of the elements inside of the chip and even validates some of the command stream. Part of the pipeline was designed to do some of the validation, such as understanding what rendering mode state it is currently in.

"The whole chip was designed to alleviate all of this work from the driver," Demers said. "In the past the drivers have (even on the 500 series) typically checked and went 'okay, the program is asking for all of these things to be set up and some of them are conflicting.'"

If there is a conflict for resources and states, the driver can turn off and think about what to do next. What you need to remember is that the driver is running on the CPU. The inherent problem with the driver thinking is that it is stealing cycles from the CPU, which could be doing other more valuable things. ATI claims to have moved almost all of that work - or, more importantly - all of the validation work, down into the hardware.

The command processor does a lot of this work and the chip is "self aware." The architecture allows it to snoop around to check what the other parts of the hardware are up to. An example of this is when the Z buffer checks in on how the pixel shader is doing. It is looking to see if it can kill pixels early. If a resource knows what is happening, it can switch to a mode that is the most compatible to accomplish its tasks.

Improvements utilizing this type of behavior can explain why consoles are faster in certain applications than on the PC. PCs are always checking the state of the application. There is a lot of overhead associated with this. If the application asks for a draw command to be sent with an associated state, the Microsoft runtime will check some of it and the driver will check some of it. It then has to validate everything and finally send it down to the hardware. ATI feels that this overhead can be so significant that it moved as much as it could into the hardware.

This is what is generally referred to as the small batch problem. Microsoft's David Blythe commented that miscommunication over application requirements, differing processing styles and mismatches between the API and the hardware were the largest complaints from developers. He concluded that his analysis "failed to show any significant advantage in retaining fine grain changes on the remaining state, so we collected the fine grain state into larger, related, immutable aggregates called state objects. This has the advantage of establishing an unambiguous model for which pieces of state should and should not be independent, and reducing the number of API calls required to substantially reconfigure the pipeline. This model provides a better match for the way we have observed applications using the API." Lowering the number draw calls, setting of constants and other commands has made DX10 gain back this overhead.

Further reductions from hardware advances can also add up to an additional reduction in driver overhead in the CPU. This can be "as much as 30%" says ATI. That does not mean applications will run 30% faster, it simply means that the typical overhead of a driver in the CPU will be decreased. Depending on the application, the CPU utilization can be as low as 1% or up to 10-15%. On average this should be only 5-7% but a 30% reduction would translate into a few % load change off the CPU. While this is not the Holy Grail for higher frame rates, it is a step in the right direction. There should be a benefit for current DX9 applications and DX10 was designed to be friendly from a small batch standpoint so it should have more benefits than DX9 in terms of CPU load reduction.