While testing a new waterblock for AMD's Threadripper platform, we stumbled across a bug in the company's firmware. The temperature values we measured compared to those reported during our launch coverage using a comparable water block were lower during overclocking (at 325W power consumption) by up to 25 Kelvin for Tctl (the cores) and up to 35 Kelvin for Tdie (the chip temperature)!
This alone wouldn't be dramatic, since the values we saw operating the CPU at its factory clock rate (and about 180W of package power) now correspond to what we can expect from a good water block, a custom loop, and a soldered IHS. Recall that for the Threadripper launch, AMD circulated a 27°C offset to the Tctl values to get to average core temperature. Thus, we were happy when our factory clock values looked realistic.
Then we found that as power consumption rose, the reported temperature went down! At 180W we saw approximately 67°C for Tctl. But this reading dropped to 51°C at about 325W (or 16 Kelvin less). This makes very little sense, of course, especially since these are also the temperature values output at idle with short, small load peaks.
We saw the same effect with Tdie. The 24°C value at 325W is nonsensical. Note that AMD's WattMan also uses this extremely low value, and motherboards do as well for their temperature-controlled fan control. As you can imagine, this causes significant issues when overclocking.
So, we set out looking for clues. In order to exclude our systems as the source of error, we tested them extensively.
We started with a new, clean Windows image with old and new drivers. We switched between three motherboards from different manufacturers (Asus, Gigabyte, and ASRock) using the latest BIOS. Still nothing to report.
But after we flashed back from BIOS 0503 to the old 0304 (used for our launch review) on Asus' X399 ROG Zenith motherboard, we saw the old temperature values once again, in addition to the already-documented stability problems. We therefore hypothesize that the cause of the error is the AGESA code 1003 Patch 4, and that it is displaying the calculated temperatures incorrectly during overclocking, with the potential for reduced fan curves during increased power consumption.
We tested further with a much weaker AIO cooler, and our overclocking led to significantly lower fan speeds when using the motherboard's PWM-controlled fans. The result is a thermal accident waiting to happen. An air cooler is therefore out of the question for now.
We have already informed AMD about these measurements, and we are awaiting a statement or a new BIOS, which we will re-test for an update. For now, we recommend manually controlling the fans when using the current BIOS versions.