The slow take-off of GPGPU computing had less to do with niche markets than it did with problematic programming. Simply put, the world was built to code compute for CPUs, and shifting some of that code over to GPUs was anything but straightforward.
"Various specialized hardware designs, such as Cell, GPGPUs, and MIC, have gained traction as alternative hardware designs capable of delivering higher flop rates than conventional designs," notes IEEE author D.M. Kunzman in the abstract for the paper Programming Heterogeneous Systems. "However, a drawback of these accelerators is that they simultaneously increase programmer burden in terms of code complexity and decrease portability by requiring hardware specific code to be interleaved throughout application code...Further, balancing the application workload across the cores becomes problematic, especially if a given computation must be split across a mixture of core types with variable performance characteristics."
Not surprisingly, all of the traditional APIs built to interface with GPUs were designed for graphics. To make a GPU compute math, one had to pretend operations were based on textures and rectangles. The great advance of OpenCL was that it dispensed with this work-around and provided a straight compute interface for the GPU. OpenCL is managed by the non-profit Khronos Group, and it is now supported by a wide range of industry players involved with heterogeneous computing, including AMD, ARM, Intel, and Nvidia.
So, if OpenCL provides a software framework for heterogeneous computing, that still doesn’t address the hardware side of the problem. Whether discussing servers, PCs, or smartphones, how should the hardware platform (distinct from the CPU, GPU, and/or APU) perform heterogeneous computing? Clearly, platforms were not designed for this paradigm in the past. The computing device typically had one system memory pool, and the programmer has to copy data from the CPU memory space to the GPU memory space—within the same pool—before the application can start executing its process. That same is true for fetching the results back again. In a system with only one memory pool, repetitive copying of data to different areas within the same memory is highly inefficient.
This is where HSA comes in. HSA brings the GPU into a shared memory environment with the CPU as a co-processor. The application gives work directly to the GPU just as it does to the CPU. The two cores can work together on the same data sets. With a shared memory space, the processors use the same points and addresses, making it much more efficient to offload smaller amounts of work because all of that old copying overhead is gone.
In addition to unified memory, AMD notes that HSA establishes cache coherency between the CPU and GPU, eliminating the need to do a DMA flush every time the programmer wants to move data between the CPU and GPU. The GPU is also now allowed to reference pageable memory, so the entire virtual memory space is available. Not least of all, HSA adds context switching, enabling quality of service. With these features in hardware, an HSA platform becomes very similar in programming style to that of a CPU.
"Shared memory makes the whole system much easier to program," adds AMD fellow Phil Rogers. "One of the barriers to using GPU compute today is a lot of programmers tell us they find it too hard. They have to learn a new API. They have to manage these different address spaces. They’re not sure when the right time is to copy the data. When you eliminate barriers like this across the board and enable high-level languages, you make it so much easier to program that suddenly you get tens of thousands of programmers working on your platform instead of dozens or hundreds. That’s a really big deal."