13th and 14th Gen Intel CPU instability also hits servers — W680 boards with Core i9 K-series chips are crashing
These non-Z series motherboards don't push 13th and 14th Gen chips nearly as hard, so what's happening?
It's already known that Intel's 13th and 14th Generation processors based around its Raptor Lake CPU architecture are experiencing serious stability issues, particularly in games. However, YouTuber Level1Techs has discovered that this problem has also been prevalent in the data center—not just gaming PCs. Game servers using W680 chipboard motherboards and high-performance Core i9-13900K and Core i9-14900K CPUs are crashing alarmingly.
The Linux enthusiast obtained crash data from thousands of servers running Intel's Raptor Lake Core i9 K-series processors. He discovered that roughly 50% of the Raptor Lake servers he obtained telemetry information from have stability issues, despite each of them running server-grade LGA1700 socket motherboards from Asus and Supermicro.
These servers' stability issues have become so prevalent that they affect how server providers conduct business with their customers. Level1Techs highlights one server provider that is charging more than $1,000 more for its Core i9-14900K-based servers compared to its Ryzen 9 7950X-powered servers for labor and onsite repair services alone ($139 vs. $1,280 for the 7950X and 14900K). This additional service charge gets tacked onto the server pricing itself.
This same server provider disclosed to Level1Tech that support incidents regarding system crashes and stability issues are unusually high with its Raptor Lake servers. To make matters worse, BIOS updates, E-core disablement, and even physical CPU swaps don't guarantee that these issues will not come back, making these Raptor Lake systems a nightmare for server providers to troubleshoot.
"...we had good luck with the 12900KS, and have always had good luck with Xeons ... something isn't right with the 13900K and 14900K. We already replaced a lot of customer's 13900K with 14900K [CPUs] and the issues don't seem to fully [get] resolved....been steering customers towards 7950X systems instead. They're almost always faster anyway."
This discovery confirms that Intel's stability problems with Raptor Lake and Raptor Lake Refresh are more complicated than ever. The server-based motherboards used by these 13900K and 14900K servers are focused entirely on stability and running the chips within specifications, with no way to overclock these chips. The fact that Intel's 13th and 14th Gen chips are still crashing suggests that the chips themselves have problems, whether an architectural problem, a clock speed issue with Intel over-tuning them from the factory, or something else.
So far, Intel has attempted to patch up Raptor Lake's problems with several "band-aids," including the introduction of a Baseline profile with safer power targets and new microcode updates to address an eTVB bug that caused specific Raptor Lake models to boost clock speeds too high beyond a particular temperature. However, all of these attempts have not fully rectified the issue. As far as we know, Intel is still investigating the core root of Raptor Lake's stability issues.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Aaron Klotz is a contributing writer for Tom’s Hardware, covering news related to computer hardware such as CPUs, and graphics cards.
-
The data from Level1Techs YT channel ALSO shows error logs at Oodle game telemetry data. Intel's 13th and 14th Gen CPUs represent a major portion of the error logs.Reply
Intel accounted for 1,431 decompression errors (out of 1584 over 90 days), while AMD, only had four such errors, which is significantly lower than Intel.
Breakdown shows that more than 70% of Intel's CPUs were prone to errors compared to 30% of AMD.
https://i.imgur.com/EKJEqyg.png -
TerryLaze Go to 17:40 in the video... "Closer look at configurations"Reply
the 14900k has a base clock for p-cores of 3.2Ghz and the servers where running them at 5.3Ghz, well actually he says that the most stable settings was to set a max multiplier of 53, for 5.3Ghz , so they had them clocked even higher than 5.3...
just because people have servers doesn't make them any less prone to being dumb.
Excessive overclocking has never been stable, just because more mature nodes can take more abuse without blowing up right away doesn't mean that they can remain stable as well.
Also I don't remember the whole video but I don't think he mentions if they set the bios according to intel baseline/default? -
PCWarrior
That’s not what he said or what this chart shows. Go to 5:07 of the video and rewatch. This chart simply says that from the cpus that reported have errors 70% are Intel and 30% are AMD. It definitely doesn’t say that 70% of Intel cpus and 30% of AMD cpus have problem or are prone to errors. And the distribution of 70%-30% of reported errors can also be attributed to market/user share as explicitly said just 8 seconds later.Metal Messiah. said:The data from Level1Techs YT channel ALSO shows error logs at Oodle game telemetry data. Intel's 13th and 14th Gen CPUs represent a major portion of the error logs.
Intel accounted for 1,431 decompression errors (out of 1584 over 90 days), while AMD, only had four such errors, which is significantly lower than Intel.
Breakdown shows that more than 70% of Intel's CPUs were prone to errors compared to 30% of AMD.
https://i.imgur.com/EKJEqyg.png -
hotaru251
or they know but can't publically say as it would lead to class action lawsuit for defective product.logainofhades said:Intel is taking way too long to figure this out. -
TerryLaze
Doubtful, AMD released a whole generation of CPUs with a fault thermal protection that was so faulty that it would outright burn out and cause the CPU to blow up like it's 1999, those where the days that AMD didn't have any thermal protection at all.hotaru251 said:or they know but can't publically say as it would lead to class action lawsuit for defective product.
If that didn't lead to a class action then just having crashes when overclocking will definitely not lead to one, no matter what the cause turns out to be. -
TJ Hooker
The W680 chipset does support overclocking. It may be odd to do so on a platform targeted at stable, professional use (and I would hope the motherboard vendors aren't juicing the settings by default like they were on their regular consumer boards), but there's nothing preventing it.Admin said:The server-based motherboards used by these 13900K and 14900K servers are focused entirely on stability and running the chips within specifications, with no way to overclock these chips.
Edit: Unless motherboard OEMs are disabling OC on their W680 boards. -
Eximo This is one of those use cases where lower core count and high clock speeds were likely the point from the beginning for the people that bought it. Not much sense in using a consumer CPU if you can't clock it like one.Reply -
TJ Hooker
14900K is specified to boost to 5.6 GHz (up to 6 GHz, on select cores if temp is kept in check). Are you arguing that Intel's own Turbo Boost technology, running within specified boost limits, constitutes "excessive overclocking"?TerryLaze said:Go to 17:40 in the video... "Closer look at configurations"
the 14900k has a base clock for p-cores of 3.2Ghz and the servers where running them at 5.3Ghz, well actually he says that the most stable settings was to set a max multiplier of 53, for 5.3Ghz , so they had them clocked even higher than 5.3...
just because people have servers doesn't make them any less prone to being dumb.
Excessive overclocking has never been stable, just because more mature nodes can take more abuse without blowing up right away doesn't mean that they can remain stable as well.
Also I don't remember the whole video but I don't think he mentions if they set the bios according to intel baseline/default? -
thestryker Asus W680 boards definitely run maximum TDP and have a decent VRM setup for a client workstation board. The primary difference over desktop being that no sort of multicore enhancement type thing exists nor do unlimited power profiles. W680 is also locked down similarly to the B series chipsets so while it has overclocking options they're not like the Z series (though they're the same chip).Reply
Supermicro lists their own TDPs on their workstation boards and have very limited VRM so I imagine those TDP listings are maximum operation.
Given that these issues have cropped up on Supermicro boards the only thing that would really make sense immediately is VF curve on the chips themselves. If it was something endemic with the die/architecture it wouldn't be predominantly 13900K/14900K+ since every RPL SKU using RC die is the same B0 stepping.