AMD's EPYC Rome Chips Crash After 1,044 Days of Uptime
A clock timer bug brings second-gen EPYCs to a halt.
AMD's latest processor revision guide for the EPYC 7002 'Rome' server chips reveals an interesting new bug (errata) that can cause a core on the chip to hang after 1,044 days of uptime (~2.93 years), after which you'll have to reset the server for the chip to run correctly. AMD says it will not fix the issue.
AMD's description of the issue, which impacts its second-gen EPYC processors (AMD's fourth-gen Genoa chips are the newest), is succinct, but there's a lot to unpack.
The issue stems from the core failing to exit the CC6 sleep state, but AMD says the timing of the failure could vary based on the spread spectrum and REFCLK frequency, the latter of which is the reference clock that helps the chip keep track of time.
Reddit user acid_migrain has a plausible theory about the exact timing of the core hangs, saying, "Despite what they say, the problem actually manifests at 1042 days and roughly 12 hours. The TSC ticks at 2800 MHz, and 2800 * 10**6 * 1042.5 days almost equals 0x380000000000000, which has too many zeros not to be a coincidence."
The workaround is simple -- either reboot before 1,044 days of uptime, which resets the CPU to restart your 1,044-day "timer," or disable the CC6 sleep state.
Now, while this 2.93-year core crashing bug is interesting, the question is if it really matters. Sure, it matters, despite the fact that security updates and maintenance should be done in much, much shorter intervals.
The most realistic scenario would simply be those that use the Linux live patching feature or kexec to update without rebooting — that could certainly lead to the type of extended uptime that would trigger the bug. Also, servers for mission-critical applications often see extended uptime.
While this bug is interesting, it isn't a showstopper for the majority of users, and errata in chips are definitely not unusual. Modern CPUs are the most complex devices constructed by humankind, and they almost always come to market with numerous errata/bugs discovered either during or after the chips reach their final shipping revision (stepping). Here's a bit more about that.
Chip Errata is Common, But Not Great
With billions of transistors in play, issues are inevitable: It isn't uncommon for a chip to have a thousand or more errata/bugs that are corrected in newer steppings of the chip or with firmware tweaks before launch. These errata can encompass all types of bugs, from security holes to malfunctioning flags and cache tags that don't operate correctly, and the chipmakers do their best to stomp them out before launch.
However, some errata always remain, even in shipping chips. For instance, Intel's 8th-gen has more than 150 listed errata that still remain, and those chips were launched in 2017. We don't know how many errata the Rome chips have had because AMD has removed the listings for errata that have been solved. However, we do know that 39 errata remain, which actually doesn't seem too bad against the Intel backdrop.
Some errata are left unrepaired simply because they pose no harm, but aside from critical errata that could leave an attack vector open, some functionality-related errata are simply never patched. The chipmaker weighs factors such as the severity of the errata, ease of fixing the issue, and if there is even a significant enough number of errata to merit spinning up another stepping -- that's not a trivial endeavor. Other bugs can be fixed with software or firmware fixes, but again, that isn't always worth the effort, or worse, the fix could result in lost performance, giving chipmakers another factor to weigh.
Why didn't AMD find it earlier? Well, 2.93 years is longer than the validation and qual cycles, and it isn't clear if accelerated aging testing, which often involves testing the equipment at higher-than-usual temps over long periods of time to simulate the aging process, could catch the bug, either. The AMD EPYC Rome chips were released in late 2018, so perhaps some of AMD's customers have already encountered the issue the hard way — in deployment.
EPYC Rome Kicked Out of the Uptime Club
And then there are the folks that just want to join the uptime club and set a record. To do that, you have to beat the computer onboard the Voyager 2 spacecraft. Yeah, the one that was the second to enter interstellar space. That computer has been running for 16,735 days (48+ years), and counting.
For terrestrial records, 6,014 days (16 years) seems to be the record for a server, but I've seen plenty of debate over other contenders for the crown. (The small /r/uptimeporn/Reddit community has plenty of examples of extended uptimes.)
In either case, you won't get to break that type of record with any of the EPYC Rome chips -- this errata will not be fixed, so not all your cores will exceed the 1,044-day threshold by much under any circumstances. AMD's note says it won't fix the issue — perhaps the company decided the issue is too costly to fix in silicon, or a microcode/firmware fix has too much performance overhead, or maybe there simply aren't enough impacted customers to make the fix worthwhile.
In either case, disabling the server's CC6 sleep state will help you sleep at night, or you could just make sure to reboot every 1,000 days or so.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Paul Alcorn is the Managing Editor: News and Emerging Tech for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.
-
usertests Kinda bad. I think Epyc/server users are more likely to want and actually achieve 3-yr/indefinite uptime than a Ryzen desktop user.Reply -
Friesiansam Far worse things to worry about, than a minor impact bug that can be prevented with a triyearly restart.Reply -
/=^>3-O-B= You misspelled error, duh.Reply
To flex in Latin, it's called an "erratum"; "errata" is the plural. -
-Fran- I don't know of any server type and ops team in which they'd allow machines to spend over a year without a restart. While not all, there's a few patching cycles that force you to restart machines and if you're not patching kernel and not restarting them, then you're doing OPs wrong IMO.Reply
Even for critical infrastructure, you plan for such scenarios.
This being said, I don't know 100% of the industry and there may be cases where they do have a valid use case for a machine to be always on from the moment it's put in service, but I just don't know of any or even could rationalize that being the case.
Anyway, the bug sounds funny and easily avoidable.
Regards. -
kaalus It's a massive bug. Not every server is security-critical and has to be updated regularly. I have Linux servers which have been running for 6+ years straight, without any changes. And I don't intend to change anything in the foreseeable future. They work perfectly well, and they will continue to do so. I would be mightily angry if it turned out I have to reboot them every 3 years.Reply -
johnrock2 I guess I could see instances when you might want the server to never restart but honestly it seems like a good idea to restart periodically anyway, not just to avoid weird timer bugs. I wouldn't patch this either if I was AMD, not a big issue and I think they don't even sell this CPU anymore.Reply -
PaulAlcorn
Same! It was the first thing I thought of, actually :)The Historical Fidelity said:I’m getting Y2K vibes with this errata lol -
Co BIY I agree that this is more a curiosity than a major issue. Nice that it can be fixed by a first line maintainer without any knowledge of the situation. "Did you try restarting it ?"Reply
The fact that they know about it does probably means someone ran up against the limit and cared enough to ask AMD to look into it.
I've never maintained any major piece of datacenter equipment but I can totally see why some would not want to shut down and restart something that is working just fine right now. It's another chance to introduce a problem.