Too Hot to Last? Investigating Intel's Claims About Ryzen Reliability

(Image credit: Tom's Hardware)

AMD's Ryzen 3000-Series processors landed two months ago, bringing with them an incredible increase in real-world performance and upsetting the pricing paradigm with an impressive increase in performance-per-dollar, but the launch has been marred by reports that many users aren’t receiving the rated boost speeds. AMD announced this week that it had identified an issue with its firmware that reduces performance in some situations and that it would update the community on an incoming fix on September 10. 

As we often see in marketing, Intel has chosen to attack during AMD's perceived time of weakness. At the IFA tradeshow this week, Intel presented a slide deck to members of the press that includes information from a recent survey conducted by YouTuber Der8auer in which a surprising number of respondents reported they have been unable to reach the rated boost frequencies with their Ryzen 3000 processors.

Interestingly, Intel then drove further on the issue, citing a report that claims reliability is behind AMD's apparent, but not proven, reasons for reducing its chips' frequencies.

We were already investigating the claims Intel cited in regards to the relationship between Ryzen's clock frequencies and longevity, and we had secured comment from AMD before its admission that there was an issue with its firmware. Today we'll present some of the testing we conducted to investigate those claims.

Ryzen 3000 Longevity Concerns

We've already dove into the boost behavior of the Ryzen 3000 processors, making several key discoveries along the way that have been confirmed by AMD, with most important finding being that Ryzen 3000 series processors come with a mix of fast and slow cores, meaning not all cores are capable of hitting the rated single-threaded boost frequencies. We also learned that the new Ryzen-aware Windows 10 scheduler targets the fastest cores with lightly-threaded workloads.

(Image credit: Intel)

We chose to look into the matter further based on a comment made by legendary overclocker and Asus engineer Shamino on the Overclock.net forums, which is the same comment that spurred the article Intel cited in the slide above. Shamino stated that AMD had dialed back the boost frequencies to bring its long-term reliability metrics more into line with the company's expectations.

"every new bios i get asked the boost question all over again, i have not tested a newer version of AGESA that changes the current state of 1003 boost, not even 1004. if i do know of changes, i will specifically state this. They were being too aggressive with the boost previously, the current boost behavior is more in line with their confidence in long term reliability and i have not heard of any changes to this stance, tho i have heard of a 'more customizable' version in the future"

It's noteworthy that Shamino made the statement from his private forum account, so we can't take it as a definitive statement from AMD, or Asus for that matter, on whether or not the company reduced the boost frequencies to extend the longevity of its chips. This is likely his opinion. Shamino also claimed that AMD might have a 'more customizable' version of its boost mechanisms in the future, but how that changes the already-existing settings is unclear.

However, Shamino's comments are even more interesting due to an earlier post by The Stilt, a well-known hardware reviewer and member of the enthusiast community that works for an unidentified motherboard vendor.  

[...]The original limits for Ryzen 3000 SKUs were:- 3600 = 4100MHz (80-95°C) / 4200MHz (< 80°C)- 3600X = 4200MHz (80-95°C) / 4400MHz (< 80°C)- 3700X = 4200MHz (80-95°C) / 4400MHz (< 80°C)- 3800X = 4300MHz (80-95°C) / 4550MHz (< 80°C)- 3900X = 4400MHz (80-95°C) / 4650MHz (< 80°C)Since then, it appears that the HighTemperature limit has been reduced further to 75°C (from 80°C).New SMUs also have introduced "MiddleTemperature" limit, but that gets disabled when PBO is enabled.HWInfo is also able to display these limits (fused values).

The Stilt claimed that AMD had lowered the temperature threshold for boost activity from 80C to 75C, which reigns in the boost activity when the chip reaches higher temperatures. This is an important distinction due to the nature of chip aging, which we'll do our best to simplify.

Given enough time, even the largest mountains in the world will erode. Zoom in on the almost unimaginably-small nanometer-scale transistors inside of a CPU that rapidly switch on and off billions of times per second (at 5.0 GHz the transistors switch at a rate of 5 billion cycles per second), and it's easy to understand that these will also erode, even under optimal operating conditions. This also includes the incredibly tiny interconnects (wires) that connect the transistors.

Some factors increase the rate of wear and trigger electromigration (the process of electrons slipping through the electrical pathways) faster, such as higher current and thermal density. Because increasing frequency requires pumping more power through the chip, thus generating more heat, higher frequencies typically result in faster aging, and thus lowered life span. These problems become more pronounced with smaller feature sizes, such as when transistors become smaller inside modern chips (like AMD's shrink to a 7nm process and Intel's shrink to 10nm), simply because the chip is pushing more current through smaller transistors and interconnects. 

So, like the carton of milk in your refrigerator, your chip has an expiration date. It's the job of smart semiconductor engineers to predict that expiration date and control it with some accuracy, which is a difficult proposition given the unique characteristics of each and every individual piece of silicon that comes out of the fab. Given that switching the transistors at higher frequencies and higher temperatures increases the rate of wear on it and the surrounding structures, this is one of the primary levers that engineers pull to control the lifespan of your chip.

In short, reducing frequency can slow the aging process, thus increasing longevity.

That means Shamino's assumption that AMD's frequency reductions in newer BIOS versions are related to longevity is a logical conclusion. It also means The Stilt's claim that temperature thresholds have been changed to reduce frequency at higher temperatures is incredibly relevant, as that would also reduce the rate of wear during the times when the processor is most vulnerable to aging (higher heat and power). However, without proper explanation from AMD we don't know if that is the primary factor behind the changes, one facet of a much more complicated equation, or if it has nothing to do with the situation at all.

We reached out to AMD for comment on August 27, 2019, asking if AMD had reduced Ryzen 3000's boost frequencies in newer motherboard firmwares to meet longevity requirements. AMD provided us with the following official statement on August 28, 2019:

AMD is committed to providing improvements and optimizations in AGESA updates as we have done with all of our processors. At present, AGESA 1003ABB is the latest release to customers and partners. As shared before, we will provide updates on enhancements when available in future AGESA versions.

Remember, we received this statement before AMD revealed that it had identified an issue in its firmware, and also before Intel quoted the reporting in its slide deck. It's noteworthy that AMD's statement doesn't address our question in any fashion.

Having reached a dead-end in our attempts to get an official explanation from AMD about the matter, the only thing left to do was to test to see if we can spot a change to the temperature thresholds in the various firmware revisions. We can't verify the impact on reliability, as CPUs don't have a wearout indicator like we see with SSDs. However, the most logical starting point is to determine if there were intentional changes to boost behavior.

Ryzen 3000 Boost Frequency Temperature Limits by BIOS and SMU Version

Again, these tests are strictly limited to ascertaining if there has been a meaningful change to the boost temperature thresholds of the Ryzen processors.

The first step is to see if we can see changes to the temperature setting. To do this, we used HWInfo to check the CPU's High Temperature Clock Limit. As you can see in the graphic, this threshold is listed at 80C, not 75C. We tested multiple versions of motherboard firmwares on multiple motherboards, and with multiple chips, and every configuration exposed the same 80C temperature limit.

That's because, as a fused value, this value is programmed into the chip during manufacturing. These readings prove that AMD, at least at some point, intended for the chip to continue to boost within normal parameters until it reaches 80C. However, it doesn't reveal the actual setting the chip uses.

The system management unit (SMU), a small unit inside the processor that can alter several parameters inside the chip, can override several parameters, including the high-temperature limit. But those override values aren't exposed to the user or monitoring tools, meaning that we can't see the actual values in use. The SMU is updated with different versions that are installed during the BIOS update process, and AMD has issued several SMU versions both before and after the Ryzen 3000 launch. The latest versions purportedly feature the changed temperature limits.

Because the temperature threshold changes remain hidden behind the SMU black box, that means we'll have to conduct a few experiments to prove AMD made adjustments.

(Image credit: Gigabyte)

We chose to use Gigabyte's X570 Aorus Master for our testing. Unlike many of the other motherboard vendors, Gigabyte still hosts its NPRP (AMD approved) BIOS that was distributed to reviewers prior to launch. The company also still has the original BIOS release posted to its site, as well as the latest revision. The company also recently posted a beta BIOS version to Overclock.net, which we'll also test.

The maximum 4.4 GHz boost frequency of the Ryzen 7 3700X appeared fleetingly with the F4 BIOS, but these boosts occurred for such a short period of time that they were hardly meaningful. Instead, the chip was most often at 4.375 for its peak clock for all BIOS revisions.

Swipe to scroll horizontally
Gigabyte BIOS RevisionMax Freq. RecordedSMU VersionAGESA
F44.4 GHz46.32.01.0.0.3
N11 (Reviewers BIOS)4.375 GHz46.37.01.0.0.2CA
F54.375 GHz46.40.01.0.0.3ABB
F5P4.375 GHz46.40.01.0.0.3ABB

Sources close to the matter tell us that the SMU revisions with the 80C limit were never made public, including in the BIOS versions provided to reviewers. However, after adjusting the temperature threshold to 75C with 46.37.0, AMD continued to adjust the temperature limits in SMU versions 46.38.0 and 46.39.0, slowly dialing back the limits, and those changes should carry over to the latest 46.40.0, as well.

It's noteworthy that test results could vary with other motherboards and boost behavior could vary based on the quality of your chip. However, these results with our AMD-provided sample should be sufficient for our purposes.

Testing Ryzen 3000 Boost Frequency Temperature Limits

The test itself is simple. We began the tests with all fans and the pump on a Corsair H115i running at full speed. We then kicked off a single-threaded Cinebench test to expose the maximum boost attainable with our Ryzen 7 3700X sample. We allowed this test to run for 60 seconds so the chip could settle into its 'natural' state during the workload, then unplugged the fans and pump and allowed the chip's temperature to rise to 95C. This is the maximum temperature rating of AMD's 7nm processors, which is a lowered range compared to the previous-gen's 100C. That means AMD doesn't have as much thermal headroom to play with. 

This testing technique allows us to record changes to the frequency as temperature increases. We did our best to isolate all parameters, and due to that nature of this testing, the cooling solution and ambient temps aren't a factor. We followed the general testing methodology we outlined in our previous piece (bottom of page 1).

Here we have the results of our tests with the F4 BIOS. This is the original Ryzen 3000-enabling BIOS made available to the public before launch. You'll notice that we're zooming in on the results of the test, cutting off the first 100 seconds of the run and constraining both the frequency and temperature axes. We aren't fans of using non-zero axes because they can exaggerate performance deltas, but we need to closely examine the relatively small variances in clock rate during a very specific portion of the test. We will include full-length plots of these same tests, but with zeroed axes, later in the article.

We have plotted the frequency of all eight cores on the left axis, while the temperature is plotted on the right. The temperature reading is the rising red line, and we've added markers to note where the temperature first reaches 75C and 80C, which are the focus areas. This BIOS comes with AGESA version 1.0.0.3 and SMU version 46.32.0. Here we can see that temperature fluctuates for a short time after reaching 75C, but the chip downshifts to ~4.125 GHz once it exceeds 75C for a duration longer than two seconds.

AMD has thousands of temperature sensors spread across the compute die, so we aren't seeing recordings of localized hot spots that may also impact boosting behavior. There is also some run-to-run variation between separate tests, but this performance trend is persistent.

Here we see four runs of the N11 NPRP (AMD approved) 'reviewer BIOS' provided to reviewers for their Ryzen 3000 testing. This BIOS features 1.0.0.2 CA AGESA code (yes, an older AGESA than the initial BIOS) paired with the 46.37.0 SMU revision. Motherboard vendors recommended these BIOS versions to reviewers for their testing, but most outlets, like ourselves, chose to go with the newer AGESA 1.0.0.3 BIOS versions made available in the waning hours of the pre-launch window.

It's noteworthy that Gigabyte specifies the SMU version with its publicly-posted N11 NPRP BIOS, but just for the sake of being thorough, we tested with both the publicly-posted BIOS (bottom two results) and the BIOS included in the reviewer download folder hosted by AMD (top two test runs) that doesn't have a listed SMU version. We do see one run with the publicly-facing N11 BIOS (bottom left) that doesn't exhibit as much of an aggressive boost behavior after the 75C threshold, but the successive run (bottom right) exhibits the same behavior as the BIOS provided on the AMD portal, meaning we can probably chalk that up to run-to-run variance.

In either case, the trend here is undeniable: In contrast to the F4 BIOS, the Ryzen 7 3700X with the N11 BIOS stays above 4.2 GHz after it reaches the 75C threshold, with peaks that vary (4.25 to 4.225 GHz) on a per-run basis. Peak clocks only drop to 4.2 GHz after the chips exceed the 80C threshold. That means this BIOS is faster than the previous version during extended workloads.

The heightened clock speeds are relatively small, an increase of 225 to 250 MHz, but as we've seen with Der8eur's survey, many users that aren't reaching peak speeds are falling short by these relatively slim variances. While these variances are comparatively small, we're looking at a difference of 225 to 250 million cycles per second, which adds up to a billion extra cycles in ~four seconds, not to mention the number of additional cycles that the transistors and interconnects will experience throughout their three-year warranty period.

In other words, it's safe to assume that these small alterations, which occur during the time of peak stress with high heat/current density, could have a meaningful impact on long-term reliability. Of course, that doesn't mean that is the intention behind the temperature threshold adjustments, but it is a possibility. It's also possible that AMD is merely tuning its boost algorithms to provide a more targeted range of effective performance, and these alterations have nothing to do with reliability metrics.

The next step is to see the current state of the BIOS. Here we're testing Gigabyte's latest, the F5 BIOS with AGESA 1.0.03 ABB. Gigabyte doesn't specify the SMU revision on its site, but this is 46.40.0. Here we can see the chip drop to 4.2 GHz after it reaches 76C, where it remains well after 80C.

This is markedly different, and less desirable, than the N11 'reviewer BIOS.' This means the chip will run slower during lightly-threaded workloads, and you won't hit peak clocks if the chip is over 75C.

Finally, these are the results of F5P, a beta BIOS that Gigabyte posted a few days ago to Overclock.net. We're testing this latest revision to see if there are any measurable changes in what is expected to be the last pre-AGESA-fix BIOS. The chip remains above 4.2 GHz until it exceeds 77C for five seconds. Again, this is slower than the reviewer BIOS. 

As promised, here is an album with 'zeroed' axes. As noted above, in the grand scheme of things these deltas may appear small, but they can be incredibly impactful from a long-term reliability perspective. Also, many of these small deltas are representative of the various boost clock reports we've seen from AMD's customers.

Thoughts

It's easy to vilify Intel for attacking AMD about these changes, especially given the unsubstantiated nature of the reports it's cited. We're accustomed to seeing unsavory marketing tactics from both AMD and Intel alike, among many other companies, but there should be some awareness at Intel that promoting unproven theories with its company logo next to them is inherently risky. It lends credibility to reports that might not have any real merit. Instead, Intel should work to put proven metrics behind statements that call into question the reliability of competing products.

Intel is quick to point out that it still holds the performance leadership position in gaming, but Ryzen does the most damage in the high-volume mid-range, and there Intel can't claim the same advantage. We fully expect the Core i9-9900KS to be a beast in most workloads, but it does nothing to address the company's shortcomings in the mid-range, which all boil down to curtailed features for the sake of keeping margins healthy. It will be fun to watch the tit-for-tat unfold when Intel has a new lineup for the mid-range.

Marketing hijinks aside, it's hard to determine if AMD has adjusted the temperature limits to reign in reliability metrics. But our tests show that there have been alterations, even after the company initially cut back to the 75C limit. And there's no doubt that the 'reviewer BIOS' is faster and sustains higher boost clocks for longer than the latest releases. Many chips aren't reaching their full boost potential even with the N11 'reviewer BIOS,' so AMD's fix might consist of returning to the original hard 80C boost temperature limit, which we're told hasn't been officially seen in the wild.

If AMD did adjust the threshold to align the chips with its reliability projections, and that's a big if, returning to the 80C limit would expose it to a higher failure rate. However, that doesn't mean Ryzen chips are going to die in droves. Chip longevity is strictly controlled to keep the number of RMAs at a tenable level, typically dictated by the financial impact to the company. Alterations could ultimately equate to a comparatively small increase in the number of failures over time. At this point, a minor increase in failure rates would certainly be preferable to a class-action lawsuit for false advertisement, not to mention the damage to the Ryzen brand.

And AMD's failure rate calculations are likely very complex, especially given that the company is using the Windows 10 scheduler to target workloads at the faster cores (which the operating system sees as favored cores).

It's a basic fact: Some cores wear out faster than others, particularly if they are utilized more frequently than others. Remember, cores often drop into lower power states/frequencies when they aren't active, which will reduce wear. However, with the scheduler targeting certain Ryzen cores more than others, that could result in faster wear on a core that is consistently more active than others. Intel also uses a similar tactic with its Turbo Boost 3.0 feature, so this isn't entirely unexplored ground, but it certainly factors into a very complex failure rate matrix. (Coincidentally, the next Windows 10 insider build has another scheduler alteration that rotates workloads more efficiently across favored cores to improve performance and reliability, but whether that just represents a broader implementation of the changes already made for Ryzen processors remains to be seen.)

Ultimately AMD's temperature adjustments aren't the only reason that users aren't reaching rated clock speeds, much of it could be users with older versions of Windows that don't target the favored cores correctly, or just general user error, but the altered thresholds are almost certainly a factor. AMD is binning these chips to the very limits of the silicon, so it has precious little wiggle room to play with.

To be clear, we stand by the recommendations we've made in both our reviews and our Best CPU articles. The Ryzen 3000 series processors bring a new class of performance, and value, to the mainstream desktop. But we also expect the products we purchase to reach their rated specifications, so we're happy to hear that AMD is busy working on a fix. 

We now have a good base of knowledge to determine if AMD's forthcoming fix, which it will unveil September 10, involves adjusting the thermal threshold further. We won't know more until the new BIOS and/or SMU revisions are in hand, but we'll be ready at our test benches when it lands.

Paul Alcorn
Managing Editor: News and Emerging Tech

Paul Alcorn is the Managing Editor: News and Emerging Tech for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.

  • jimmysmitty
    One of the biggest issues we face with silicone today is the degradation of the material as it gets smaller and under higher temperatures. Its why a lot of smaller process technology has lower temperature thresholds, although we still push them.

    Intel actually did at one point have an idea to have a CPU designed with reserve cores so that if a core died or had issues it could be activated and the dying/dead core could be brought back.

    Another solution that Intel, IBM and any company involved in process technology are looking for is alternative materials to Silicon that can survive the stresses better.

    I doubt we will see a Ryzen CPU just die any more than an Intel would die in its useful lifetime. So its not a major issue but I do wonder if AMD does know there is potential there,
    Reply
  • redgarl
    Why are you posting Intel's garbage propaganda? You are doing them a favor for spreading their fud! That's sound like Intel 5GHz 28 cores demo all over again... and their 10nm ice lake paper launch... you didn't learned anything yet???!!!
    Reply
  • MasterMadBones
    redgarl said:
    Why are you posting Intel's garbage propaganda? You are doing them a favor for spreading their fud! That's sound like Intel 5GHz 28 cores demo all over again... and their 10nm ice lake paper launch... you didn't learned anything yet???!!!

    From the article, indicating that Tom's doesn't approve of Intel's claims:
    there should be some awareness at Intel that promoting unproven theories with the company logo next to them is risky. It lends credibility to reports that might not have any real merit. Instead, Intel should work to put proven metrics behind statements that call into question the reliability of competing products.

    Ice lake is no paper launch (yet). It takes a couple of months for devices with new CPUs to appear on the market, especially when they require a brand new platform.

    Pitchforks aside, the article does give a decent insight into what may be the cause of Ryzen 3000's frequency "problem". And indeed, AMD would not have changed its boost bins for no reason. The company does seem a little concerned about reliability.

    Even then, the maximum boost frequency often falls 25-75MHz shy of the advertised speed, which appears to be a legitimate problem with how Precision Boost handles SenseMI data. I will put the claims of differences up to 300MHz down to confirmation bias. Many people have a lot of processes running in the background that they're not aware of, which can throw the Windows scheduler off and make it target multiple cores at low usage. Those are also the people who wouldn't normally check their CPU frequency, unless they saw a forum post saying some people are experiencing issues.
    Reply
  • mamasan2000
    If you as a company don't feel threatened, you wouldn't comment at all, it becomes a non-issue.
    This sounds just like Nvidia and Jensen Huangs comments about AMD GPUs.
    https://www.techpowerup.com/251400/nvidia-ceo-jensen-huang-on-radeon-vii-underwhelming-the-performance-is-lousy-freesync-doesnt-work?cp=36 months later and Nvidia supports Freesync too, officially, on select monitors. So how lousy can it be?
    Reply
  • MasterMadBones
    mamasan2000 said:
    If you as a company don't feel threatened, you wouldn't comment at all, it becomes a non-issue.
    This sounds just like Nvidia and Jensen Huangs comments about AMD GPUs.
    https://www.techpowerup.com/251400/nvidia-ceo-jensen-huang-on-radeon-vii-underwhelming-the-performance-is-lousy-freesync-doesnt-work?cp=36 months later and Nvidia supports Freesync too, officially, on select monitors. So how lousy can it be?
    Worse, Nvidia tried to blame AMD when their own "G-Sync compatible" implementation didn't work, even though AMD's GPUs supported it just fine. It's even more ridiculous when you realize that Freesync is just AMD's brand name for the VESA Adaptive Sync standard.
    Reply
  • BFG-9000
    LoL Intel has found a way to make their inability to go below 14nm a "feature."

    As for the "problem," each time AMD CPUs have become competitive they have had to push their technology too far--hence the PIII 1.13GHz recall and the cancellation of the 4GHz P4. Both teams have now taken up most of the margin that overclockers used to use--why leave free performance on the table when you could use it, due to competition? That's why things are barely overclockable nowadays.

    Also, processors used to have a 10-year design life so why leave that much margin if they only come with a 3-year warranty anyway? That would explain why the maximum safe voltage ratings have barely changed--back in 2007 the maximum safe voltage (above which immediate damage may occur) was 1.45v on a 45nm process. 12nm Ryzen was also rated 1.45v and Zen 2 is 1.325v on a 7nm process. That must come out of the lifespan margin, but AMD has to still be confident in 3 years of life @ 100% 24/7--as they are betting the company on it.

    When you overclock, you are using that margin. A factory overclock is no different.

    It's interesting to note that 4000-series Haswell has a reputation for degrading over time while Ivy Bridge does not, when both are 22nm.
    Reply
  • cryoburner
    Considering how you rarely hear of CPUs failing, even overclocked ones, I don't suspect endurance is much of a problem. Even if endurance happened to be somewhat lower at 7nm, an increase in failures from "almost never" to "extremely unlikely" probably isn't going to be a concern. At the very least, I highly doubt AMD is concerned about getting returns within the chip's 3-year warranty, since the number of people making use of such a warranty would undoubtedly be infinitesimally small.
    Reply
  • ryzengamer
    You can remove the CPU frequency scheduler from the Linux kernel and use Bios to control the CPU.
    Reply
  • nicalandia
    This site has become a Joke, spreading FUD and Click baits in favor of Intel.
    Reply
  • logainofhades
    nicalandia said:
    This site has become a Joke, spreading FUD and Click baits in favor of Intel.
    You obviously didn't read the full article.
    Reply