AMD's EPYC Rome Chips Crash After 1,044 Days of Uptime

Ryzen die
(Image credit: Fritchenz Frenz)

AMD's latest processor revision guide for the EPYC 7002 'Rome' server chips reveals an interesting new bug (errata) that can cause a core on the chip to hang after 1,044 days of uptime (~2.93 years), after which you'll have to reset the server for the chip to run correctly. AMD says it will not fix the issue.

AMD's description of the issue, which impacts its second-gen EPYC processors (AMD's fourth-gen Genoa chips are the newest), is succinct, but there's a lot to unpack.

AMD

(Image credit: AMD)

Chip Errata is Common, But Not Great

With billions of transistors in play, issues are inevitable: It isn't uncommon for a chip to have a thousand or more errata/bugs that are corrected in newer steppings of the chip or with firmware tweaks before launch. These errata can encompass all types of bugs, from security holes to malfunctioning flags and cache tags that don't operate correctly, and the chipmakers do their best to stomp them out before launch.

However, some errata always remain, even in shipping chips. For instance, Intel's 8th-gen has more than 150 listed errata that still remain, and those chips were launched in 2017. We don't know how many errata the Rome chips have had because AMD has removed the listings for errata that have been solved. However, we do know that 39 errata remain, which actually doesn't seem too bad against the Intel backdrop.

Some errata are left unrepaired simply because they pose no harm, but aside from critical errata that could leave an attack vector open, some functionality-related errata are simply never patched. The chipmaker weighs factors such as the severity of the errata, ease of fixing the issue, and if there is even a significant enough number of errata to merit spinning up another stepping -- that's not a trivial endeavor. Other bugs can be fixed with software or firmware fixes, but again, that isn't always worth the effort, or worse, the fix could result in lost performance, giving chipmakers another factor to weigh. 

Why didn't AMD find it earlier? Well, 2.93 years is longer than the validation and qual cycles, and it isn't clear if accelerated aging testing, which often involves testing the equipment at higher-than-usual temps over long periods of time to simulate the aging process, could catch the bug, either. The AMD EPYC Rome chips were released in late 2018, so perhaps some of AMD's customers have already encountered the issue the hard way — in deployment. 

EPYC Rome Kicked Out of the Uptime Club

And then there are the folks that just want to join the uptime club and set a record. To do that, you have to beat the computer onboard the Voyager 2 spacecraft. Yeah, the one that was the second to enter interstellar space. That computer has been running for 16,735 days (48+ years), and counting.

For terrestrial records, 6,014 days (16 years) seems to be the record for a server, but I've seen plenty of debate over other contenders for the crown. (The small /r/uptimeporn/Reddit community has plenty of examples of extended uptimes.)

In either case, you won't get to break that type of record with any of the EPYC Rome chips -- this errata will not be fixed, so not all your cores will exceed the 1,044-day threshold by much under any circumstances. AMD's note says it won't fix the issue — perhaps the company decided the issue is too costly to fix in silicon, or a microcode/firmware fix has too much performance overhead, or maybe there simply aren't enough impacted customers to make the fix worthwhile.

In either case, disabling the server's CC6 sleep state will help you sleep at night, or you could just make sure to reboot every 1,000 days or so. 

TOPICS
Paul Alcorn
Editor-in-Chief

Paul Alcorn is the Editor-in-Chief for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.

  • The Historical Fidelity
    I’m getting Y2K vibes with this errata lol
    Reply
  • AtomicRobotMan0101
    !RemindMe 1043 days
    Reply
  • usertests
    Kinda bad. I think Epyc/server users are more likely to want and actually achieve 3-yr/indefinite uptime than a Ryzen desktop user.
    Reply
  • Friesiansam
    Far worse things to worry about, than a minor impact bug that can be prevented with a triyearly restart.
    Reply
  • /=^>3-O-B=
    You misspelled error, duh.
    To flex in Latin, it's called an "erratum"; "errata" is the plural.
    Reply
  • -Fran-
    I don't know of any server type and ops team in which they'd allow machines to spend over a year without a restart. While not all, there's a few patching cycles that force you to restart machines and if you're not patching kernel and not restarting them, then you're doing OPs wrong IMO.

    Even for critical infrastructure, you plan for such scenarios.

    This being said, I don't know 100% of the industry and there may be cases where they do have a valid use case for a machine to be always on from the moment it's put in service, but I just don't know of any or even could rationalize that being the case.

    Anyway, the bug sounds funny and easily avoidable.

    Regards.
    Reply
  • kaalus
    It's a massive bug. Not every server is security-critical and has to be updated regularly. I have Linux servers which have been running for 6+ years straight, without any changes. And I don't intend to change anything in the foreseeable future. They work perfectly well, and they will continue to do so. I would be mightily angry if it turned out I have to reboot them every 3 years.
    Reply
  • johnrock2
    I guess I could see instances when you might want the server to never restart but honestly it seems like a good idea to restart periodically anyway, not just to avoid weird timer bugs. I wouldn't patch this either if I was AMD, not a big issue and I think they don't even sell this CPU anymore.
    Reply
  • Paul Alcorn
    The Historical Fidelity said:
    I’m getting Y2K vibes with this errata lol
    Same! It was the first thing I thought of, actually :)
    Reply
  • Co BIY
    I agree that this is more a curiosity than a major issue. Nice that it can be fixed by a first line maintainer without any knowledge of the situation. "Did you try restarting it ?"

    The fact that they know about it does probably means someone ran up against the limit and cared enough to ask AMD to look into it.

    I've never maintained any major piece of datacenter equipment but I can totally see why some would not want to shut down and restart something that is working just fine right now. It's another chance to introduce a problem.
    Reply