AMD's latest processor revision guide for the EPYC 7002 'Rome' server chips reveals an interesting new bug (errata) that can cause a core on the chip to hang after 1,044 days of uptime (~2.93 years), after which you'll have to reset the server for the chip to run correctly. AMD says it will not fix the issue.
AMD's description of the issue, which impacts its second-gen EPYC processors (AMD's fourth-gen Genoa chips are the newest), is succinct, but there's a lot to unpack.
The issue stems from the core failing to exit the CC6 sleep state, but AMD says the timing of the failure could vary based on the spread spectrum and REFCLK frequency, the latter of which is the reference clock that helps the chip keep track of time.
Reddit user acid_migrain has a plausible theory about the exact timing of the core hangs, saying, "Despite what they say, the problem actually manifests at 1042 days and roughly 12 hours. The TSC ticks at 2800 MHz, and 2800 * 10**6 * 1042.5 days almost equals 0x380000000000000, which has too many zeros not to be a coincidence."
The workaround is simple -- either reboot before 1,044 days of uptime, which resets the CPU to restart your 1,044-day "timer," or disable the CC6 sleep state.
Now, while this 2.93-year core crashing bug is interesting, the question is if it really matters. Sure, it matters, despite the fact that security updates and maintenance should be done in much, much shorter intervals.
The most realistic scenario would simply be those that use the Linux live patching feature or kexec to update without rebooting — that could certainly lead to the type of extended uptime that would trigger the bug. Also, servers for mission-critical applications often see extended uptime.
While this bug is interesting, it isn't a showstopper for the majority of users, and errata in chips are definitely not unusual. Modern CPUs are the most complex devices constructed by humankind, and they almost always come to market with numerous errata/bugs discovered either during or after the chips reach their final shipping revision (stepping). Here's a bit more about that.
Chip Errata is Common, But Not Great
With billions of transistors in play, issues are inevitable: It isn't uncommon for a chip to have a thousand or more errata/bugs that are corrected in newer steppings of the chip or with firmware tweaks before launch. These errata can encompass all types of bugs, from security holes to malfunctioning flags and cache tags that don't operate correctly, and the chipmakers do their best to stomp them out before launch.
However, some errata always remain, even in shipping chips. For instance, Intel's 8th-gen has more than 150 listed errata that still remain, and those chips were launched in 2017. We don't know how many errata the Rome chips have had because AMD has removed the listings for errata that have been solved. However, we do know that 39 errata remain, which actually doesn't seem too bad against the Intel backdrop.
Some errata are left unrepaired simply because they pose no harm, but aside from critical errata that could leave an attack vector open, some functionality-related errata are simply never patched. The chipmaker weighs factors such as the severity of the errata, ease of fixing the issue, and if there is even a significant enough number of errata to merit spinning up another stepping -- that's not a trivial endeavor. Other bugs can be fixed with software or firmware fixes, but again, that isn't always worth the effort, or worse, the fix could result in lost performance, giving chipmakers another factor to weigh.
Why didn't AMD find it earlier? Well, 2.93 years is longer than the validation and qual cycles, and it isn't clear if accelerated aging testing, which often involves testing the equipment at higher-than-usual temps over long periods of time to simulate the aging process, could catch the bug, either. The AMD EPYC Rome chips were released in late 2018, so perhaps some of AMD's customers have already encountered the issue the hard way — in deployment.
EPYC Rome Kicked Out of the Uptime Club
And then there are the folks that just want to join the uptime club and set a record. To do that, you have to beat the computer onboard the Voyager 2 spacecraft. Yeah, the one that was the second to enter interstellar space. That computer has been running for 16,735 days (48+ years), and counting.
For terrestrial records, 6,014 days (16 years) seems to be the record for a server, but I've seen plenty of debate over other contenders for the crown. (The small /r/uptimeporn/Reddit community has plenty of examples of extended uptimes.)
In either case, you won't get to break that type of record with any of the EPYC Rome chips -- this errata will not be fixed, so not all your cores will exceed the 1,044-day threshold by much under any circumstances. AMD's note says it won't fix the issue — perhaps the company decided the issue is too costly to fix in silicon, or a microcode/firmware fix has too much performance overhead, or maybe there simply aren't enough impacted customers to make the fix worthwhile.
In either case, disabling the server's CC6 sleep state will help you sleep at night, or you could just make sure to reboot every 1,000 days or so.