Sign in with
Sign up | Sign in

Are Cell's SPEs really redundant logic units?

Could 10 - 20% yields for Cell processors lead to problems for Sony PS3?
By

Whether Reeves' logic holds up depends in great measure on whether each SPE in a Cell processor could be considered a redundant part. In an August 2005 interview with Cell's principal designer at IBM, Dr. H. Peter Hofstee, he explained to us in rather extensive detail the differences between a synergistic processor element in a Cell and a core in an Intel or AMD processor. A Cell is comprised of a single principal processor element (PPE) that is essentially a current generation Power processor. It is not replicated. The other processing elements - the SPEs - are there to handle what is called scalar code: tasks that involve repetitive and reiterative operations, such as shading a texel or dividing a complex number. The SPEs, it was made clear to us, are not replicates of the PPE.

One term used to explain the instruction set the Cell uses is single-instruction/multiple-data (SIMD). Essentially, a SIMD instruction applies a logical operation to multiple sets of data, and those multiple sets can then be processed independently in a processor geared to handle scalar code. A GPU accomplishes this by implementing multiple pipelines - Nvidia's 7900 GTX utilizes 24 pixel pipelines and 8 vertex pipelines. Cell accomplishes this using SPEs. However, as is the case with graphics processing, the data flow itself has not been multiplied. In other words, the processor is still doing one thing, but just breaking up the steps in-between and delegating them to the SPEs.

"One common perception that I think is not accurate," Dr. Hofstee told us at that time, "is that, because the synergistic processors have a single data flow - which is SIMD - a lot of people seem to think that you can only SPEs appropriately for problems that are SIMD parallel. I think that is a misperception." In Cell architecture, there are no multiple caches for the multiple SPEs, nor are there multiple register sets. Instead, there's one big single-register set with 128 registers that can be accessed by all the SPEs at all times. And replacing the multiple caches are something Dr. Hofstee refers to as a local store, which is a trigger for a three-tier memory architecture that lets the SPEs access a single, bigger pool of memory.

"The reason we went to the local store, three-level memory hierarchy - registers, local store, and shared memory," Dr. Hofstee explained, "is something called the memory wall: the fact that microprocessors have gotten faster by a factor of 1000 in the last 20 years, and latency hasn't gone down all that much." He referred to a principle named for Intel engineer Pat Gelsinger, called "Gelsinger's Law:" a corollary of Moore's Law that states that, for every time the number of transistors is doubled on a processor, it delivers not double the performance but instead just 40%. It was this "law" which helped drive Intel - and AMD - toward multicore architecture in the first place.

Clearly the goal of multicore architecture from this vantage point was not to create "redundant logic," but moreover multiplied logic - a way of doubling the horsepower and achieving something closer to double the performance. But as Dr. Hofstee explained, those portions of the processor which a manufacturer chooses to replicate, could easily end up contending with one another for priority when it comes time for them to share a single computer system. Case in point: when two cores want the same area of memory.

"When you have a miss and you have to wait for memory, I sort of compare it to this 'bucket brigade,'" remarked Dr. Hofstee. "You might have 100 people in the bucket brigade between the core (fire) and the memory (water), but if you only have five outstanding fetches - a bucket brigade with 100 people and five buckets - it just isn't going to be very efficient. So in conventional microprocessors, if I take a conventional microprocessor and I double the memory bandwidth, I might only see a very incremental performance improvement, because in fact, delivered memory bandwidth is limited by the latency induced." In other words, if you replicate everything, you create new latencies when everything has to work together. Therefore, you don't replicate everything - not in smart processor architecture.

So what does this mean, applied to the numbers IBM's Tom Reeves provided last week? Certainly there are replicated elements in the Cell processor, and we've learned that even a horsepower-hungry PS3 doesn't need all of them. But if two SPEs aren't really the same as two cores, then perhaps it should not follow that the number of defects to be anticipated should necessarily double, or otherwise be compounded by the number of SPEs. In short, eight SPEs should not necessarily mean eight times the number of defects, any more than doubling the number of transistors (in accordance with Moore's Law) should reduce the yield rate below the 95% mark which foundries, at one time, enjoyed.

Thus if the Cell processor really is lucky to see 10% to 20% yields, as Reeves indicated, then if you take Dr. Hofstee's explanation into account, there must be some other reason for it. Nonetheless, "the real surprise here is that Reeves gave an estimate of the actual yield," Insight64's Nathan Brookwood told us. "Semi guys normally won't say anything about yield, on or off the record."

Speak out in the TG Daily reader survey!