13th and 14th Gen Intel CPU instability also hits servers — W680 boards with Core i9 K-series chips are crashing

Raptor Lake
Raptor Lake (Image credit: Intel)

It's already known that Intel's 13th and 14th Generation processors based around its Raptor Lake CPU architecture are experiencing serious stability issues, particularly in games. However, YouTuber Level1Techs has discovered that this problem has also been prevalent in the data center—not just gaming PCs. Game servers using W680 chipboard motherboards and high-performance Core i9-13900K and Core i9-14900K CPUs are crashing alarmingly.

The Linux enthusiast obtained crash data from thousands of servers running Intel's Raptor Lake Core i9 K-series processors. He discovered that roughly 50% of the Raptor Lake servers he obtained telemetry information from have stability issues, despite each of them running server-grade LGA1700 socket motherboards from Asus and Supermicro.

These servers' stability issues have become so prevalent that they affect how server providers conduct business with their customers. Level1Techs highlights one server provider that is charging more than $1,000 more for its Core i9-14900K-based servers compared to its Ryzen 9 7950X-powered servers for labor and onsite repair services alone ($139 vs. $1,280 for the 7950X and 14900K). This additional service charge gets tacked onto the server pricing itself.

This same server provider disclosed to Level1Tech that support incidents regarding system crashes and stability issues are unusually high with its Raptor Lake servers. To make matters worse, BIOS updates, E-core disablement, and even physical CPU swaps don't guarantee that these issues will not come back, making these Raptor Lake systems a nightmare for server providers to troubleshoot.

"...we had good luck with the 12900KS, and have always had good luck with Xeons ... something isn't right with the 13900K and 14900K. We already replaced a lot of customer's 13900K with 14900K [CPUs] and the issues don't seem to fully [get] resolved....been steering customers towards 7950X systems instead. They're almost always faster anyway."

This discovery confirms that Intel's stability problems with Raptor Lake and Raptor Lake Refresh are more complicated than ever. The server-based motherboards used by these 13900K and 14900K servers are focused entirely on stability and running the chips within specifications, with no way to overclock these chips. The fact that Intel's 13th and 14th Gen chips are still crashing suggests that the chips themselves have problems, whether an architectural problem, a clock speed issue with Intel over-tuning them from the factory, or something else.

So far, Intel has attempted to patch up Raptor Lake's problems with several "band-aids," including the introduction of a Baseline profile with safer power targets and new microcode updates to address an eTVB bug that caused specific Raptor Lake models to boost clock speeds too high beyond a particular temperature. However, all of these attempts have not fully rectified the issue. As far as we know, Intel is still investigating the core root of Raptor Lake's stability issues.

Aaron Klotz
Contributing Writer

Aaron Klotz is a contributing writer for Tom’s Hardware, covering news related to computer hardware such as CPUs, and graphics cards.

  • logainofhades
    Intel is taking way too long to figure this out.
    Reply
  • The data from Level1Techs YT channel ALSO shows error logs at Oodle game telemetry data. Intel's 13th and 14th Gen CPUs represent a major portion of the error logs.

    Intel accounted for 1,431 decompression errors (out of 1584 over 90 days), while AMD, only had four such errors, which is significantly lower than Intel.

    Breakdown shows that more than 70% of Intel's CPUs were prone to errors compared to 30% of AMD.


    https://i.imgur.com/EKJEqyg.png
    Reply
  • TerryLaze
    Go to 17:40 in the video... "Closer look at configurations"
    the 14900k has a base clock for p-cores of 3.2Ghz and the servers where running them at 5.3Ghz, well actually he says that the most stable settings was to set a max multiplier of 53, for 5.3Ghz , so they had them clocked even higher than 5.3...
    just because people have servers doesn't make them any less prone to being dumb.
    Excessive overclocking has never been stable, just because more mature nodes can take more abuse without blowing up right away doesn't mean that they can remain stable as well.

    Also I don't remember the whole video but I don't think he mentions if they set the bios according to intel baseline/default?
    Reply
  • PCWarrior
    Metal Messiah. said:
    The data from Level1Techs YT channel ALSO shows error logs at Oodle game telemetry data. Intel's 13th and 14th Gen CPUs represent a major portion of the error logs.

    Intel accounted for 1,431 decompression errors (out of 1584 over 90 days), while AMD, only had four such errors, which is significantly lower than Intel.

    Breakdown shows that more than 70% of Intel's CPUs were prone to errors compared to 30% of AMD.


    https://i.imgur.com/EKJEqyg.png
    That’s not what he said or what this chart shows. Go to 5:07 of the video and rewatch. This chart simply says that from the cpus that reported have errors 70% are Intel and 30% are AMD. It definitely doesn’t say that 70% of Intel cpus and 30% of AMD cpus have problem or are prone to errors. And the distribution of 70%-30% of reported errors can also be attributed to market/user share as explicitly said just 8 seconds later.
    Reply
  • hotaru251
    logainofhades said:
    Intel is taking way too long to figure this out.
    or they know but can't publically say as it would lead to class action lawsuit for defective product.
    Reply
  • TerryLaze
    hotaru251 said:
    or they know but can't publically say as it would lead to class action lawsuit for defective product.
    Doubtful, AMD released a whole generation of CPUs with a fault thermal protection that was so faulty that it would outright burn out and cause the CPU to blow up like it's 1999, those where the days that AMD didn't have any thermal protection at all.
    If that didn't lead to a class action then just having crashes when overclocking will definitely not lead to one, no matter what the cause turns out to be.
    Reply
  • TJ Hooker
    Admin said:
    The server-based motherboards used by these 13900K and 14900K servers are focused entirely on stability and running the chips within specifications, with no way to overclock these chips.
    The W680 chipset does support overclocking. It may be odd to do so on a platform targeted at stable, professional use (and I would hope the motherboard vendors aren't juicing the settings by default like they were on their regular consumer boards), but there's nothing preventing it.

    Edit: Unless motherboard OEMs are disabling OC on their W680 boards.
    Reply
  • Eximo
    This is one of those use cases where lower core count and high clock speeds were likely the point from the beginning for the people that bought it. Not much sense in using a consumer CPU if you can't clock it like one.
    Reply
  • TJ Hooker
    TerryLaze said:
    Go to 17:40 in the video... "Closer look at configurations"
    the 14900k has a base clock for p-cores of 3.2Ghz and the servers where running them at 5.3Ghz, well actually he says that the most stable settings was to set a max multiplier of 53, for 5.3Ghz , so they had them clocked even higher than 5.3...
    just because people have servers doesn't make them any less prone to being dumb.
    Excessive overclocking has never been stable, just because more mature nodes can take more abuse without blowing up right away doesn't mean that they can remain stable as well.

    Also I don't remember the whole video but I don't think he mentions if they set the bios according to intel baseline/default?
    14900K is specified to boost to 5.6 GHz (up to 6 GHz, on select cores if temp is kept in check). Are you arguing that Intel's own Turbo Boost technology, running within specified boost limits, constitutes "excessive overclocking"?
    Reply
  • thestryker
    Asus W680 boards definitely run maximum TDP and have a decent VRM setup for a client workstation board. The primary difference over desktop being that no sort of multicore enhancement type thing exists nor do unlimited power profiles. W680 is also locked down similarly to the B series chipsets so while it has overclocking options they're not like the Z series (though they're the same chip).

    Supermicro lists their own TDPs on their workstation boards and have very limited VRM so I imagine those TDP listings are maximum operation.

    Given that these issues have cropped up on Supermicro boards the only thing that would really make sense immediately is VF curve on the chips themselves. If it was something endemic with the die/architecture it wouldn't be predominantly 13900K/14900K+ since every RPL SKU using RC die is the same B0 stepping.
    Reply