AMD's struggle with unbootable Epyc Naples and Rome prototype CPUs revealed in its new Advanced Insights series

AMD Advanced Insights Ep1
(Image credit: AMD)

AMD has published the first episode of a new YouTube video series dubbed Advanced Insights, with AMD CTO Mark Papermaster serving as the host. This episode centered on AMD’s disruptive entry and seemingly inexorable growth in the data center. However, it hasn’t been all plain sailing, as the two execs publicly discussed some early teething troubles with booting Epyc Naples and Rome prototype chips for the first time.

During the series' first outing, Mark Papermaster chatted with the EVP & GM of the Data Center Solutions Business Unit at AMD, Forrest Norrod. The first guest was there to talk about disrupting markets, and AMD’s last nine or so years in the server business were used as an illustration of the power of disruptive technology.

Unbootable Epyc CPUs

AMD didn’t get to where it is today in the server market without experiencing a few speed bumps. Norrod recalled that when the first Epyc Naples chip samples arrived, they couldn’t boot up. It isn’t clear from the interview what part of the Epyc chip caused this immediate and drastic hurdle in the labs. However, the initially flustered engineers managed to get things ticking along after applying some “ingenuity and perseverance.”

When the first Epyc Rome chips arrived in AMD’s test labs, the engineers were again faced with a no-boot problem. This time, an issue in the early chip sample’s memory access was the boot-hindering issue. Again, frantic and highly technical work managed to get the chip into a testable booting state.

When drastic bugs like this are discovered, Norrod says he applies a 72-hour rule to avoid any knee-jerk reactions. People still get busy firefighting issues, but a few days later, things typically look better, and a clearer strategy can be devised, reckons the 35-year seasoned exec. If you are lucky, after 72 hours, sometimes the AMD engineers will have fully solved the issues.

AMD Forrest Norrod

(Image credit: AMD)

Zen differentiation through the generations

Throughout the Advanced Insights video, the discussion highlights Zen-based disruptions to the server market over generations. About nine years ago, when the Zen family was new, AMD dearly wanted a competitive CPU core – plus some differentiation. With Epyc Naples, the first Zen core CPUs for data centers were delivered, and they brought with them a boost in memory channels, I/O, and cores. Epyc Naples lived up to HPC expectations, which was hugely important, asserted Papermaster.

Subsequently, Epyc Rome was disruptive as AMD’s first chiplet-based server processor. The key to its success was choosing the right semiconductor play for each chiplet, the discussion implied. On the topic of chiplets, Norrod was keen to praise Sam Naffziger as “the godfather of chiplets.” 

Chiplets allowed AMD to keep scaling, embrace new tech quicker, and resolve non-uniform memory access (NUMA) issues – confounding AMD’s critics. Memory and I/O have to scale with core count to “feed the beast,” stressed Papermaster in agreement with Norrod. Infinity Fabric also played an important part here.

Epyc Milan was characterized as yet another inflection point. With this new generation, AMD says it put poor single-thread comparisons behind it, and Norrod is confident that AMD secured “leadership across the board” with this generation. Now, AMD could present to customers with evidence of a predictable, trustable roadmap of a company that would stay in the business for the long haul.

Other disruptions that Papermaster and Norrod reckon have been behind the continued success of AMD Epyc processors and servers include:

  • Security – confidential computing for servers. Norrod says he first heard about this differentiator highlighted in discussions about AMD’s game console chips. Introducing the new (at the time) PlayStation and Xbox chips, engineers talked about compelling security features without performance impacts, and without software modifications, prompting Norrod to declare “holy cow.”
  • TCO – Norrod asserts that 80% of customers “should be on a single socket [servers] today.” Considering workloads, this is correct, but fighting human nature with education is quite a grind. The AMD execs reckon it is only a matter of time, though, as the Epyc CPUs are claimed to have big TCO benefits.

Last but not least, the episode couldn’t be complete without some mention of artificial intelligence. AI is disruption – next, reckon the AMD execs. Papermaster said that he has “never been so excited than by the opportunity I see right now.” His excitement apparently stems from AMD’s depth and breadth of talent and experience. 

Norrod interjected to remind viewers that AI processing presents a massive math problem, requiring enormous matrix and vector calculations, reliant on the best GPU, memory, IO, network, CPU, and storage technologies available to be competitive. “There is no better company than AMD to take all that on,” concluded Norrod.

Mark Tyson
Freelance News Writer

Mark Tyson is a Freelance News Writer at Tom's Hardware US. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.

  • TechLurker
    I like background insight stories like this. GN also did a similar deep dive with AMD on their campus too, and it was pretty cool learning some of the stories behind the impressive revival of AMD and some of the tools they needed to debug their prototypes.
    Reply
  • setx
    Can this disruptive text be any more disruptive?
    Reply
  • missingegg
    I'm an engineer, and have worked on CPU chips. People unfamiliar with the field often don't realize how challenging an engineering endeavor it can be. While simulation tool are excellent, they're both not entirely accurate, and it's also difficult to incorporate everything in a real system into the simulation. As a result, it's actually extremely common to have initial sample chips not function correctly, and need revision. Tools exist to modify sample chips, in order to experiment and try out different modifications to fix a problem (e.g. altering the metal layers with a focused ion beam).
    Reply