Larrabee Versus GPU
As we saw earlier, Larrabee doesn’t look very much like a GPU at all with the exception of its texture units. There’s no sign of the other fixed units you usually find, like the setup engine, which converts triangles into pixels and interpolates the different attributes of the vertices, or the ROPs, which write pixels to the frame buffer and also resolve anti-aliasing and perform any necessary blending operations. In Larrabee’s case, all of these operations are performed directly by the cores. The advantage of this approach is that it allows greater flexibility. With blending, for example, it’s possible to manage transparency independently of the order in which primitives are sent, which is especially complicated for today’s GPUs. The downside is that Intel will have to give its GPU greater processing power than its competitors, which can use the dedicated units for this kind of task, leaving the programmable units to concentrate on shading.
Although GPUs have become very flexible since the arrival of Direct3D, they’re still far from the flexibility Larrabee can offer. One of the main limitations of GPUs, even using APIs like CUDA, is memory access. As with the Cell, memory management is fairly constraining, and to get optimum performance, the number of registers used has to be minimized and small memory zones of a few kilobytes also have to be used.
What’s more, despite the flexibility GPUs have gained, their functionalities remain heavily oriented towards raw calculation. For example, there’s no question of performing I/O operations from a GPU. Conversely, Larrabee is totally capable of that, meaning that Larrabee can directly perform printf or file-handling operations. It’s also possible to use recursive and virtual functions, which is impossible with a GPU.
Obviously, all of this functionality doesn’t come without cost, and there will necessarily be an impact on the program’s efficiency, since they go against the paradigm of parallel programming But that remains acceptable for code that’s not used in an area that’s sensitive to performance. Making this kind of code possible without using the CPU opens up some interesting possibilities. For example, modern GPUs include a just-in-time (JIT) compiler in their drivers to adapt shaders to specific details of their architecture on the fly when given that task. Larrabee is no exception to the rule, but instead of integrating this compiler with the processor, it runs directly on Larrabee.
Another interesting opportunity is that code can be debugged directly by Larrabee, whereas to debug CUDA code, it’s generally indispensable to use emulation via the CPU. In cases like this, the CPU emulates the GPU, but doesn’t precisely simulate its behavior, and certain bugs can be difficult to identify.