Q&A: Under The Hood With Adobe, Cont.
Tom's Hardware: Within Photoshop, what limits exist in terms of what you can do with these APIs?
Russell Williams: With some things, we can look at them and know they're not suited for OCL or the GPU in general. In other cases, it's only after we expend some effort, by implementing it, that we discover we're not going to get the speed-up that will justify the effort. It's well known that the GPU is completely suited for certain kinds of things and completely unsuited for others. I believe it was AMD that told us that that while a GPU can speed up the things it is suited to by several hundred times compared to the CPU, a problem that it is ill-suited to GPUs, something inherently sequential, then it might be 10 times slower.
Some people think, “If the GPU is so much faster, then why not do everything on the GPU?” But the GPU is only suited for certain operations. And for every operation you want to run on there, you have to re-implement it. For instance, we accelerated the Liquify filter with OGL, not OCL, and that makes a tremendous difference. For large brushes, it goes from 1 to 2 FPS to being completely fluid, responsive, and tracking with your pen. That kind of responsiveness for modifying that much data could only be done on a GPU. But it took one engineer most of the entire product development cycle for CS6 to re-implement the whole thing.
Tom's Hardware: Which gets us back to why was only one feature implemented in OCL this time around. You don't have an infinite number of developers and only one year between versions.
Russell Williams: That's right. And, of course, we have an even more limited supply of developers that already know OCL.
Tom's Hardware: Did graphics vendors play a role in your OpenCL adoption? Education, tools, and so forth?
Russell Williams: We didn't have much input on creating the tools they had. But they gave us a tremendous amount of help in both learning OpenCL and in using the tools they have given us. Both Nvidia and AMD gave us support in prototyping algorithms, because both of their interests are to make more use of the GPU. For us, the big issue is where the performance is. We can't count on a particular level of GPU being in a system. Many systems have Intel integrated graphics, which have more limited GPU and OpenGL support, and no OpenCL support. A traditional C-based implementation has to be there, and it’s the only thing we can count on being there. On the other hand, if something is performance-critical, the GPU is really where most of the compute power is in the box.
Beyond that, AMD had their QA/engineering teams constantly available to us. We had weekly calls, access to hardware for testing, and so on. Nvidia and Intel helped, too, but AMD definitely stepped up.
Tom's Hardware: So which company has the better products, AMD or NVIDIA? [laughs]
Russell Williams: You're Tom's Hardware. You know that depends on what you're running and which week you ask the question—and how much money you have in your pocket. That's why it is so critical for us to support both vendors. Well, three if you include Intel integrated graphics, which is starting to become viable.
Tom's Hardware: At some point, performance bottlenecks are inevitable. But how far out do you look when trying to avoid them? Do you say, “Well, we’re already getting five times better performance—that’s good enough!” Or do you push as far as possible until you hit a wall?
Russell Williams: We do think about that, but it’s very hard. It’s impossible to know in a quantitative way, ahead of time, to know what those will be. We know qualitatively that we should spend a lot of time on bandwidth issues. Photoshop is munging pixels, so the number of times that pixels have to be moved from here to there is a huge issue, and we pay attention to every time in the processing pipeline that happens. Quite often, that’s more the limiting factor than just computation being done on the pixels. In particular, when you have discrete graphics, there is an expensive step of moving the pixels to the graphics card and back.
Tom's Hardware: So the bus is usually your bottleneck?
Russell Williams: Yes, the PCI bus. I really expect in the future that APUs will require us to rethink which features can be accelerated. Particularly once APUs start using what some people call zero-copy. When you have an APU, you don't have to go across that long-latency PCI bus to get to the GPU. Right now, you still have to go through the driver, and it still copies from one place in main memory to another place in memory—the space reserved for CPU and another place reserved for GPU. They're working on eliminating that step. And as they make that path more efficient, it becomes more and more profitable to do smaller and smaller operations on the APU, because the overhead on each one is smaller.
On the other hand, APUs are not as fast as discrete GPUs. In some ways, it comes down to on-card memory bandwidth versus main memory. But you can also think of it as power budgets with discrete cards that are sucking down several hundred watts by themselves. You have to keep copying large things across this small pipe to this hairdryer of a compute device. It just depends. There has to be enough computation involved to pay for copying it out across the PCI bus and bringing it back.