Skylake CPU bug

Telekino

Reputable
Jan 2, 2016
48
0
4,540
I was reading about a bug in the new skylake CPU. Apparently in very rare cases, in specific applications, the CPU can crash. Intel is saying that the bug can be fixed with an update to the BIOS. Do you guys know anything about this bug?

http://www.pcworld.com/article/3021023/hardware/how-to-test-your-pc-for-the-skylake-bug.html

I'm building a new PC with the skylake CPU. Should I be worried? And how long would it take the motherboard manufacturers to release a BIOS update for this?
 
Solution
The "fatal combo" is...

- Skylake AND
- HyperThreading AND
- AVX2 AND
- 768K FFT size

If you are going to test this and you are running Prime95 v28.7 you need to disable FMA3 by entering "CpuSupportsFMA3=0" into the local.txt file (you may need to create the file). Previous versions don't need to do this.

So, if you are running a 6600K like me, or other Skylakes that don't have Hyper-Threading, you're safe.
Nearly all CPUs have bugs. It's hard to design a billion transistors without getting a few things wrong. Usually all they do is crash whatever program triggers it or cause a blue screen. Occasionally one is mild enough to cause computation errors.

Usually they're fixed with a BIOS update.
 


Or running any kind of stress test that uses prime numbers, like Prime95. Or running any number of scientific and financial applications. Or other applications that employ complex workloads.

But for most users, in everyday use, it's unlikely to be a problem. For gamers, probably not at all. I expect we'll see a microcode fix for this pretty quickly though.
Some bugs on the processor can be fixed by microcode.
A typical retail motherboard can have as many as eight
microcode files, covering the compatible CPU table
on the motherboard maker site, and those are stored
in the BIOS flash chip. Microcode is tiny, and variable
length. The last time I took apart that file, the segments
were in multiples of 2KB or so.

Microcode releases have a revision number, and
a patcher loading a microcode, is allowed to
install a patch which has a higher release number
than the one currently in the processor.

The BIOS has its microcode patcher. The microcode must
be good enough, to allow the system to boot into the OS.
So no storage bugs can exist with the shipped BIOS
microcode. All it has to do, is get the system booted.

Windows and Linux also have microcode patchers.
The Windows one does its job early after boot,
and then the service exits. So you don't really
see it.

The Windows one allows deployment of updates.
It's unclear how much faster either a BIOS update
would deploy a new version, versus how fast
Microsoft could push a new file via Windows Update.

If you have a copy of the Intel Processor Identification
Utility (PIU), the field "revision" is actually the
release number of the microcode. There was one incident,
where no microcode was getting loaded, and the number
was zero. Most of the time, you will find a small finite
number for that field. In some cases, the utility
mistakenly masks the value read out, and some digits
may not belong there. (Maybe you see F07 instead
of 07.)

Some bugs in processors are fixed by actual code.
When AMD had a TLB bug in the 9500, they distributed
maybe a 15KB or so code module, to be added to the
BIOS. This code disabled the TLB, or a portion of
it, costing a small amount of performance. A
fixed version of the processor, for the same family,
had "50" added to the lower digits, so if you bought
a 9550 you knew it was fixed, whereas a 9500 wasn't.
So that fix wasn't microcode based, because it
wasn't an actual instruction problem. It was a
problem with virtual to physical address translation
of some sort.

The average processor has 100 errata. Some of the errata
are discovered a year or two after the first batch is
distributed for sale. Testing continues after release.
Many bugs are repaired via microcode updates. Some
are labeled "won't fix", meaning even if a new mask
revision was in the pipe, they had no plan to patch
out the problem. Some issues are innocuous enough they
don't need fixing.

In the case of the Prime95 issue above, the hand-coded FFTs
are perfect material for uncovering bugs. Frequently,
compilers produce "lame" code that doesn't give particularly
good fault coverage. So you don't see bugs, because the
instruction sequences aren't that challenging.

One AMD processor, had an FPU bug caused by actual
electrical noise. It was discovered after release.
It took assembler code to do it. The assembler code
consisted of a nonsensical continuous sequence of
one FPU instruction after another. This drew enough
current to cause a noise problem in the substrate.
Errata like that receive a "will not fix" rating,
because it is not expected that anyone will be
coding with assembler, and using that stupid a sequence
of instructions. Real FPU code needs an occasional
bit test, branch condition, and so isn't solid 100% FPU
instructions one after another. And when a HLL is used,
the compiler/assembler wouldn't even get close to
the required FPU code density to break that processor.
(If I owned such a processor though, I'd be pissed.
For that not being caught in testing, or recognized
as a potential issue during design.)

When it comes to test benches for hardware design, you
run the important ones first (and try to finish them by
design close). The ridiculous tests are saved for later,
after production has begun. And that's when the AMD testers
carried out their artificial 100% density test and discovered
a problem. For our chip designs, some staff were running
simulation test cases a year after we had hardware in hand.
(And ours didn't have microcode to patch with either.
We had another feeble mechanism for emergencies Smile )

The level of bugs is rather constant. I don't recollect ever
looking at an errata sheet for a CPU and seeing zero bugs. It
just doesn't happen. I expect in some cases, staff already know
of multiple errata, even before design close, but the
boss says "ship it". I doubt they would hold up a mask
release, chasing every possible bug and making the CPU
two years late. That just isn't going to happen, especially
when the "good ole microcode" can pull your bacon out of the
fire.

So while it's sad that this "bendable" processor also has
an errata, it probably has another 99 errata to keep that
one company. Most of those errata are invisible to end
users. The janitorial staff already cleaned up the mess
 
The "fatal combo" is...

- Skylake AND
- HyperThreading AND
- AVX2 AND
- 768K FFT size

If you are going to test this and you are running Prime95 v28.7 you need to disable FMA3 by entering "CpuSupportsFMA3=0" into the local.txt file (you may need to create the file). Previous versions don't need to do this.

So, if you are running a 6600K like me, or other Skylakes that don't have Hyper-Threading, you're safe.
 
Solution
All newer versions of Prime95 that use AVX instructions shouldn't be used for thermal or stability testing anyhow. It's not steady state and creates unrealistic thermal conditions due to the AVX instruction set. Version 26.6 should be used for testing, and I can tell you for a fact, even without the AVX instructions, there is still a computational problem that results in fatal errors once prime gets past a certain point. Even underclocked and undervolted, testing has duplicated the problem repeatedly somewhere around test 15 or 16 on my 6700k.
 

heliomphalodon

Distinguished
Jan 20, 2007
42
0
18,540
I was just about to buy a new system, but until I see benchmark results to indicate that no performance degradation results from the fix... I'll be waiting in a state of FUD. Surely Tom's expert readers will be posting their benchmarks before/after the upcoming fix, right?
 

heliomphalodon

Distinguished
Jan 20, 2007
42
0
18,540
Now that Microsoft has revealed that Windows 7 support on Skylake will be terminated on 18 July 2017, my FUD is cured!
https://blogs.windows.com/windowsexperience/2016/01/15/windows-10-embracing-silicon-innovation/
No Skylake for me - I'll proceed with previous-generation silicon, so that I can avoid for as long as possible the surveillance engine that is Windows 10. If I still care about computers in 2023 when Windows 8.1 support ends on pre-Skylake silicon, then I'll check my options at that time.
Thank you Microsoft for saving me some money on my new system! I wonder if Intel is equally grateful...
 

Amyrro

Reputable
Oct 23, 2015
52
0
4,640


I am planning to build my own rig for a coming project in CFD (Computational Fluid Dynamics), which is very CPU-intensive when involving gas flow.

I have had some experience with intensive FEA tasks before, when i7 processors (1st or 2nd Gen, I am not sure) took an hour or a little more to finish calculating the results.

In this CFD task, I am expecting heavier calculations. I might even end up running simultaneous FEA & CFD tests. I thought about 4th Gen i7-4790K, as I did not find a lot of talk about such bugs in it, plus it gives comparable (or slightly better) single threaded performance, and added to that, I could find some discounted pre-built rigs, as system builders are trying to clear out their stock for the new Skylake builds.

Before the bug news, I considered i7-4790K for its price and performance, but leaned more towards the Skylake for better "future-proofing" (as some like to describe it), and the expected better support for Skylake with future technologies on the long term.

My question is, after the this news, is it worth to take the risk and rely on an i7-6700K in my case, especially that I cannot wait for the fix for long time (I can wait for a month max). Added to that the fact I do not quite trust BIOS fixes, as I did not have pleasant experience with Samsung and HP and their BIOS fixes for hardware issues.

Regards

 
If you can, I'd wait a little while to see if a fix is released. Almost certainly the next gen Skylake chips won't have this issue, since it's a bit different than the typical errata and will hopefully be addressed prior to manufacturing. In reality, I'd probably say that perhaps the 5820k on the X99 platform might honestly be a better choice for what you plan to do. And it's less expensive than the 6700k anyhow.
 

Amyrro

Reputable
Oct 23, 2015
52
0
4,640


Thanks for your reply

Do you think it going to be available int he market within a month from now ? Because I think it is not just about fixing the error from Intel, but I am not sure if suppliers will supply the latest CPUs from Intel, as they may be having a stock that they want to finish in the market, before the corrected CPUs from Intel make it to the Market.
 



It's going to be really hard to figure out whether the processor you order has the bug fixed or not. You'd have to know the Stepping to know and that's usually not advertised by the retailer. I'd not feel comfortable with a Skylake for a number of months - maybe 6 or more.
 

Telekino

Reputable
Jan 2, 2016
48
0
4,540
I just built my computer today with the skylake CPU. So far no problems, but I only loaded Windows and havnt done anything else yet. I'm only going to play games, type documents, simple use of excel, web browsing, and ordinary everyday use. Hopefully I don't have any problems.
 

Telekino

Reputable
Jan 2, 2016
48
0
4,540


So what's the worst case scenario for me? I'm playing a game and the system crashes, I restart my computer and lose all progress after the last saved checkpoint?

If this is the worst that can happen, it's not a big deal, so long as it doesn't happen frequently. Even if it happened once a month it wouldn't be a big deal. But it sounds like the chances of it happening are like once in a million years.
 
No games that I know of process Mersenne Prime numbers, and are not really considered to be "complex" workloads. It shouldn't affect gamers. Running Prime95 or other stress testing software, or "maybeee" a few select VERY high end science and financial applications that use similar processing algorithyms "could" but not necessarily "would" be affected. For everything else, the indications are that not even once in a million years would you have even known about it if you hadn't read the article or seen reports on the forums.
 

TRENDING THREADS