AMD's EPYC Rome Chips Crash After 1,044 Days of Uptime

Ryzen die
(Image credit: Fritchenz Frenz)

AMD's latest processor revision guide for the EPYC 7002 'Rome' server chips reveals an interesting new bug (errata) that can cause a core on the chip to hang after 1,044 days of uptime (~2.93 years), after which you'll have to reset the server for the chip to run correctly. AMD says it will not fix the issue.

AMD's description of the issue, which impacts its second-gen EPYC processors (AMD's fourth-gen Genoa chips are the newest), is succinct, but there's a lot to unpack.

AMD

(Image credit: AMD)

Chip Errata is Common, But Not Great

With billions of transistors in play, issues are inevitable: It isn't uncommon for a chip to have a thousand or more errata/bugs that are corrected in newer steppings of the chip or with firmware tweaks before launch. These errata can encompass all types of bugs, from security holes to malfunctioning flags and cache tags that don't operate correctly, and the chipmakers do their best to stomp them out before launch.

However, some errata always remain, even in shipping chips. For instance, Intel's 8th-gen has more than 150 listed errata that still remain, and those chips were launched in 2017. We don't know how many errata the Rome chips have had because AMD has removed the listings for errata that have been solved. However, we do know that 39 errata remain, which actually doesn't seem too bad against the Intel backdrop.

Some errata are left unrepaired simply because they pose no harm, but aside from critical errata that could leave an attack vector open, some functionality-related errata are simply never patched. The chipmaker weighs factors such as the severity of the errata, ease of fixing the issue, and if there is even a significant enough number of errata to merit spinning up another stepping -- that's not a trivial endeavor. Other bugs can be fixed with software or firmware fixes, but again, that isn't always worth the effort, or worse, the fix could result in lost performance, giving chipmakers another factor to weigh. 

Why didn't AMD find it earlier? Well, 2.93 years is longer than the validation and qual cycles, and it isn't clear if accelerated aging testing, which often involves testing the equipment at higher-than-usual temps over long periods of time to simulate the aging process, could catch the bug, either. The AMD EPYC Rome chips were released in late 2018, so perhaps some of AMD's customers have already encountered the issue the hard way — in deployment. 

EPYC Rome Kicked Out of the Uptime Club

And then there are the folks that just want to join the uptime club and set a record. To do that, you have to beat the computer onboard the Voyager 2 spacecraft. Yeah, the one that was the second to enter interstellar space. That computer has been running for 16,735 days (48+ years), and counting.

For terrestrial records, 6,014 days (16 years) seems to be the record for a server, but I've seen plenty of debate over other contenders for the crown. (The small /r/uptimeporn/Reddit community has plenty of examples of extended uptimes.)

In either case, you won't get to break that type of record with any of the EPYC Rome chips -- this errata will not be fixed, so not all your cores will exceed the 1,044-day threshold by much under any circumstances. AMD's note says it won't fix the issue — perhaps the company decided the issue is too costly to fix in silicon, or a microcode/firmware fix has too much performance overhead, or maybe there simply aren't enough impacted customers to make the fix worthwhile.

In either case, disabling the server's CC6 sleep state will help you sleep at night, or you could just make sure to reboot every 1,000 days or so. 

TOPICS
Paul Alcorn
Editor-in-Chief

Paul Alcorn is the Editor-in-Chief for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.