HLSL's, Cg and the RenderMonkey

HLSLs Don't Change The Hardware

Although an HLSL is an easy way to write VS/ PS effects, it doesn't remove the limitations that already exist on the hardware. More importantly, programmers using an HLSL are going to need to know what hardware they're writing for, and they're going to have to think very carefully about the constraints that are placed on that class of hardware.

With DX8 hardware, the most fundamental constraints for the vertex shaders are the constant memory limits and instruction limits. DX8 vertex shaders also have a very limited amount of conditional processing, and no branch or jump instructions. This means that, in the future, people may be writing HLSL code that carries out conditional branching, but they'll always need to think of how to remove the branches on the DX8 hardware target.

When you're using a HLSL, these limits are hidden from you. As a result, you won't really realize that you've hit something like the instruction limit until the compiler tells you. HLSLs also can complicate things for programmers who don't fully understand the hardware. For example, take Cg's loop statements: DX8 vertex shaders don't allow looping, but Cg lets the programmer write statements which will be executed a constant number of times on this hardware. What really happens is that this loop is "unrolled" in software and the shader itself just contains the same piece of code as many times as the loop specified. So a loop that carries out four instructions four times actually takes up 16 instructions in the vertex shader. This could become a problem if programmers don't realize the full ramifications of loop statements, particularly if they decide to loop over eight lights, carrying out eight instructions per light. That's killed off half of your potential instructions pretty darn quickly.

With pixel shaders, the limitations of the hardware are a lot stronger than with vertex shaders. To start with, pixel shader instructions are divided into two separate blocks. The first of these are the texture addressing instructions, while the second are the arithmetic instructions. These two blocks of instructions always have to be executed totally separately. So the programmer loads up all of his textures, and then carries out arithmetic on the values read from the textures to give the final output color.

So, if you wanted to load two textures and use them to read into a cubemap, it might be nice to be able to do something like this:

tex t0 // read texture 0

tex t1 // read texture 1
mad r0, t0, t1, c0 // c0 = 0.5
texload t2, r0

Unfortunately, this mixes the texture block with the arithmetic instruction. In fact, there is no way to explicitly load a texture using an intermediate result stored in a temporary register (let's ignore PS1.4 for now). Now, I can hear a few of you saying, "Aha, but what about dependent texture reads?" Well yes, dependent texture reads do exist in DX8 pixel shaders, but frankly, they're a bit of a pain in the arse to use.

To load up one bumpy reflection, we have to carry out the following instructions:

tex t0

texm3x3pad t1, t0
texm3x3pad t2, t0
texm3x3vspec t3, t0

This involves four texture instructions, two textures and four sets of texture coordinates. The texture coordinates are used to provide a matrix by which the normal map is rotated before it is used to look up the reflection. Unfortunately for us, there are only four sets of texture coordinates in total, so this bumpy reflection has used up all of our possible textures in one go. Any further modification, such as a diffuse texture, will need to be carried out using a further pass.

These sorts of limitations are not obvious when the programmer only has to write something like this to load the texture:

colour = tex_3d(normal, vec0, vec1, vec2, cubemap);

So, the pixel shader programmer is always going to have to be aware of the underlying nature of the hardware they're writing for, which eliminates some of the niceties of HLSLs being platform independent.