AMD Senior Fellow Kevin Lepak took to the stage at Hot Chips 2017 to explain the reasoning behind EPYC's MCM (Multi-Chip Module) design and to remind us that the company decided to use multiple die very early in the design process.
Intel presented the now-famous EPYC slide deck during its Purley server event. Intel claimed that AMD merely uses "glued together desktop die" for its EPYC data center chips, and that its multi-chip design, which also makes an appearance in the company's Threadripper models, suffers from poor latency and bandwidth that hamstrings performance in critical workloads. Surprisingly, Barclays downgraded AMD's stock shortly thereafter. AMD suffered a short-term stock slump as a result. Although Barclays didn't directly cite Intel's claims, the reasoning (and explanation) was eerily similar.
But AMD has tailored its design to address some of the challenges associated with an MCM architecture and claims that its design provides a 41% cost reduction compared to a single monolithic die. Let's dive in.
MCM Delivers A 41% Cost Reduction
As we covered in our Threadripper 1950X review, AMD CEO Lisa Su tasked her team with developing a cutting-edge data center processor to challenge Intel's finest. The team realized early on that a single monolithic die couldn't deliver on the company's performance, memory, and I/O goals. Lepak revealed that the decision also stemmed from the company's cost projections. Lepak presented a mock-up of a monolithic EPYC processor and compared manufacturing costs between the two techniques.
AMD projects that a single EPYC die would weigh in at 777mm2, whereas the four-die MCM requires 852mm2 of total die area. AMD claims the 10% die area overhead is relatively tame. The company specifically architected the Zeppelin die for an MCM arrangement, so it focused on reducing the die overhead of replicated components. For instance, each of the four Infinity Fabric links consumes only 2mm2 of die area.
Each of AMD's Zeppelin die contains memory, I/O, and SCH (Server Control Hub--similar to a Northbridge) controllers, but the company removed those redundant parts from its monolithic die cost projection. The company also removed the Infinity Fabric links, as those obviously aren't needed for a single chip.
It's rational to assume that the MCM's larger overall die area would result in higher costs, but AMD claims the design actually reduces cost by 41%. All die suffer from defects during manufacturing, but larger die are more susceptible. Smaller die produce better yields, thus reducing the costly impact of defects. AMD can configure around defects in the cores or cache by disabling the unit and using the die for lower-priced SKUs, but defects that fall into I/O lanes or other critical pathways are usually irreparable.
Each die wields four Infinity Fabric links. AMD uses only three links per die to minimize trace length, and thus latency. As you can see, the activated links vary based on the location of the die inside the MCM. Logically, the Threadripper models use only two Infinity Fabric links because they have only two die.
Each die also has two I/O controllers to maximize bandwidth. One feeds into the "G" blocks at the top of the diagram for communication between processors, while the other feeds into the "P" banks at the bottom for PCIe connections. AMD noted the distributed I/O approach ensures consistent performance scaling in two socket servers. Threadripper likely has a somewhat different arrangement because it doesn't need to communicate with another processor, likely leaving only one I/O controller active per die for the PCIe lanes.
Memory throughput and latency can suffer due to MCM architectures. In fact, that's one of Intel's key arguments in its infamous slide deck. AMD presented DRAM bandwidth tests outlining performance in various configurations. The "NUMA Friendly" bandwidth represents memory accesses to the die's local memory controller, while "NUMA Unaware" measures memory traffic flowing over the Infinity Fabric from a memory controller connected to another die.
Obviously, AMD was aware of the memory throughput challenges associated with an MCM design, so it overprovisioned the memory subsystem to accommodate the complexity. Bandwidth varies by only 15% at full saturation. Notably, throughput scales well with limited variation between the different types of access during lighter workloads.
Peer-to-peer (P2P) communication between GPUs is important for AI workloads, which is one of the fastest growing segments in the data center, so performance is critical. AMD's EPYC has an SCH, which is similar to an integrated Northbridge. AMD's switching mechanism inside the processor can reroute device-to-device communication without sending it through the processor's memory subsystem, so it functions very much like a normal switch. This allows EPYC platforms to provide a full 128 PCIe 3.0 lanes without using switches, which reduces cost and complexity. Of course, that doesn't mean much if it doesn't provide the same performance.
AMD presented performance data from a single socket server that shows EPYC offers solid P2P performance when the data flow traverses the Infinity Fabric. The company also presented DMA performance metrics. The "Local DRAM" column quantifies performance when a GPU does DMA access to the memory controller connected to the same die, while the "die-to-die" column measures performance with a DMA request to another die (across the Infinity Fabric). As you can see, performance is similar and even better in some cases. AMD divulged that the Infinity Fabric, which is an updated version of the coherent HyperTransport protocol, holds the directory tables in a dedicated SRAM buffer and also supports Multi-Cast probe.
AMD also presented memory throughput and scaling benchmarks that show an impressive lead over Intel's processors, but the company is still comparing to Broadwell-era processors. As we found in our review, Intel's latest data center processors offer a big generational leap in memory throughput. Lepak explained that the company has had difficulties sourcing Intel's Skylake-based Purley processors but is working to provide updated comparisons. (We've included the test notes in a click-to-expand format at the end of the article.)
Although AMD has taken the high road and not responded directly to Intel's EPYC slide deck, its Hot Chips presentation seems to dispute a few of Intel's key arguments. First and foremost, AMD designed the die specifically for the data center. Intel's "glue" remark refers to glue logic, which is an industry term for interconnects between die (in this case AMD's Infinity Fabric). In either case, the insinuation that AMD is merely using desktop silicon for the data center certainly has a negative connotation. However, AMD's tactic is innovative and reduces cost. As we stated in our piece covering Intel's deck:
This mitigates the risk that goes into manufacturing complex, monolithic processors, potentially improving yields and keeping costs in check. It also helps the company increase volume at a time when it's going to want plenty of supply. It's a smart strategy for a fabless company that doesn't have Intel's R&D budget to throw around.
Interestingly, Intel's Programmable Solutions Group outlined the company's EMIB (Embedded Multi-die Interconnect Bridge) technology at the show, too. EMIB provides a communication pathway between separate chips to create a unified processing solution, which Intel considers a key technology for its next-gen processors. Although the approaches are different, the motivation behind Intel's EMIB and AMD's Infinity Fabric are similar, which AMD feels validates its approach.
In either case, AMD is humming right along with its EPYC processors, and a wide range of blue chip OEMs and ODMs have platforms coming to market. AMD also announced at the China EPYC technology summit that Tencent will deploy EPYC solutions by the end of the year and JD.com will deploy in the second half of the year. We expect more announcements in the future as the war for data center sockets rages on.