Scaling The Brick Wall
This AMD core team found itself with two fundamental problems, one technical and the other philosophical, and both had to be solved before anything could move forward.
"On a pure technology and transistor side, we had a conundrum on our hands," says AMD’s Macri. "What makes CPUs go really fast ends up burning a lot of power on a GPU. What makes GPUs go really fast will actually slow a CPU down incredibly. So the first thing we ran into was just getting them to live on the die together. We had the high-speed transistor combined with the very low-resistance metal stack that’s optimal for CPUs versus the GPU’s more moderate-speed transistor optimized around very dense metalization. If you look at the GPU’s metal stack, it looks like the letter T. It looks like the letter Z in a CPU. One’s low-resistance, one’s lower density, and so higher resistance. We knew we had to get these guys to live on the same die where they both perform very well, because no one’s going to give us any accolades if the CPU drops off or the GPU power goes up or performance falls. We needed to do both well. We very quickly discovered that wall."
Imagine the pressure on that team. With billions of dollars and the company’s future at stake, the group eventually realized that a hybrid solution couldn’t exist on the current 45 nm process. Ultimately, 45 nm was too optimized for CPU. Understanding that, the question then became how to tune 32 nm silicon-on-insulator (SOI) so that it would effectively play both sides of the fence. Of course, 32 nm didn’t exist outside of the lab yet, and much of what finally defined the 32 nm node for AMD grew from the Fusion pursuit.
Unfortunately, until the 32 nm challenge was solved, Fusion was at a standstill—and it took a year of work to reach that solution. Only then could design work begin.
Meanwhile, the Fusion team was also fighting a philosophical battle. With the transistor and process struggle, it was massive, but at least the team knew where it needed to go and what the finish line looked like. Even with the transistor challenge figured out, the question still remained of how to best architect an APU.
"One view was like, the GPU should be mostly used for visualization. We should really keep the compute on the CPU side," says Macri. "A second view said, no, we’ve gotta split the views across the two halves. We’ve got this beautiful compute engine on the GPU side; we need to take advantage of it. There were two camps. One said things should be more tightly coupled between the CPU and GPU. Another camp said things should be more loosely coupled. So we had to have this philosophical debate of deciding what we should treat as a compute engine. Through a lot of modeling, we proved that there was an enormous advantage to a vector engine when you have inherent parallelism in your code."
This might have seemed obvious from ATI’s prior work with Stream, but the question was how much work to throw at the GPU. Despite being highly parallel, GPUs remain optimized for visualization. They can process traditional parallel compute tasks, but this introduces more overhead. With more overhead comes more impact on visualization. With infinite available transistors on the die, one could just keep throwing resources at the problem. But, of course, there are only a few hundred million transistors to go around.
"Think of all the applications of the world as a bathtub," says Macri. "If you look at the left edge of the bathtub, we call those applications the least parallel, the ones with the least amount of inherent parallelism. A good example of that would be pointer chasing, right? You need a reference. You need to go grab that memory to figure out the next memory you gotta go grab. No parallelism there at all. The only way to parallelize is to start guessing–prediction. Then, if you go to the right edge of the bathtub, matrix multiply is a great example of a super-parallel piece of code. Everything is disambiguated very nicely, read and write stream is all separate, it’s just beautiful. You can parallelize that out the wazoo. For those applications, it’s very low overhead to go and map that into a GPU. To do the left side well, though, means building a low-latency memory system, and that would load all kinds of problems into a GPU that really wants a high-bandwidth, throughput-optimized memory system. So we said, 'How do we shrink the edges of the bathtub?' Because, the closer we could bring those edges, the more programs we could address in a very efficient way."
A big part of the philosophical debate boiled down to how much to shrink those bathtub edges while preserving all of AMD’s existing visualization performance. Naturally, though, while all of this debate was happening, AMD was getting hammered in the market.