AMD Threadripper BIOS Reporting Wrong Temps?

While testing a new waterblock for AMD's Threadripper platform, we stumbled across a bug in the company's firmware. The temperature values we measured compared to those reported during our launch coverage using a comparable water block were lower during overclocking (at 325W power consumption) by up to 25 Kelvin for Tctl (the cores) and up to 35 Kelvin for Tdie (the chip temperature)!

This alone wouldn't be dramatic, since the values we saw operating the CPU at its factory clock rate (and about 180W of package power) now correspond to what we can expect from a good water block, a custom loop, and a soldered IHS. Recall that for the Threadripper launch, AMD circulated a 27°C offset to the Tctl values to get to average core temperature. Thus, we were happy when our factory clock values looked realistic.

Then we found that as power consumption rose, the reported temperature went down! At 180W we saw approximately 67°C for Tctl. But this reading dropped to 51°C at about 325W (or 16 Kelvin less). This makes very little sense, of course, especially since these are also the temperature values output at idle with short, small load peaks.

We saw the same effect with Tdie. The 24°C value at 325W is nonsensical. Note that AMD's WattMan also uses this extremely low value, and motherboards do as well for their temperature-controlled fan control. As you can imagine, this causes significant issues when overclocking.

So, we set out looking for clues. In order to exclude our systems as the source of error, we tested them extensively.

We started with a new, clean Windows image with old and new drivers. We switched between three motherboards from different manufacturers (Asus, Gigabyte, and ASRock) using the latest BIOS. Still nothing to report.

But after we flashed back from BIOS 0503 to the old 0304 (used for our launch review) on Asus' X399 ROG Zenith motherboard, we saw the old temperature values once again, in addition to the already-documented stability problems. We therefore hypothesize that the cause of the error is the AGESA code 1003 Patch 4, and that it is displaying the calculated temperatures incorrectly during overclocking, with the potential for reduced fan curves during increased power consumption.

We tested further with a much weaker AIO cooler, and our overclocking led to significantly lower fan speeds when using the motherboard's PWM-controlled fans. The result is a thermal accident waiting to happen. An air cooler is therefore out of the question for now.

We have already informed AMD about these measurements, and we are awaiting a statement or a new BIOS, which we will re-test for an update. For now, we recommend manually controlling the fans when using the current BIOS versions.

  • cia1413
    Why would you quote a K value? I would bet less than 1% of people know intuitively what that means in relation to C.
    Reply
  • Clamyboy74
    20122130 said:
    Why would you quote a K value? I would bet less than 1% of people know intuitively what that means in relation to C.

    Its just math, subtract 273
    Reply
  • Patrick_Bateman
    In case anyone's wondering, -1C=272K; 0C=273K; 1C=274K; you get the picture.
    Reply
  • JamesSneed
    20122130 said:
    Why would you quote a K value? I would bet less than 1% of people know intuitively what that means in relation to C.

    Personally it bothered me more mixing Kelvin and Celsius temperatures throughout the article. Just pick a direction for god sakes.
    Reply
  • TheTechGen
    Off topic but I want that mother board!
    Reply
  • derekullo
    Might as well mix the rest in;
    Fahrenheit, Rankine, Delisle, Newton, Réaumur, Rømer and Electronvolts.

    Newton, at some archaic time, was a measure of temperature lol.
    https://en.wikipedia.org/wiki/Temperature#Units
    Reply
  • InvalidError
    My guess about where the reading difference comes from? AMD didn't use a separate ground circuit for its temperature sensing element and the "temperature" reading ends up offset by the "voltage droop" on the ground as current goes up. Electrical resistance also goes up with temperature, which could compound the effect.
    Reply
  • Gillerer
    20122173 said:
    20122130 said:
    Why would you quote a K value? I would bet less than 1% of people know intuitively what that means in relation to C.

    Its just math, subtract 273

    In delta temperatures, value in K is exactly the same as value in °C: ∆T of +1 K = +1°C.
    Reply
  • willwill56
    lol at all of the comments confused by Kelvin.

    Specifying change in temperature in Kelvin is perfectly fine. 10 Kelvin colder is, by definition of the Kelvin itself, exactly the same as 10 Celsius colder.
    Reply
  • davidchaoth
    "We have already informed AMD about these measurements, and we are awaiting a statement or a new BIOS, ..."

    AMD's statement: All our employees' paychecks have been correctly deposited; the auto-deposit function works well as expected : )
    Reply