Intel In-Field Scan Feature in Sapphire Rapids Pinpoints Per-Core Errors

Sapphire Rapids
(Image credit: Los Alamos National Laboratory)

Intel posted a new Linux driver yesterday that surfaced an interesting new feature coming to Sapphire Rapids server processors. The feature takes silicon integrity and reliability checks to the next level and is dubbed In-Field Scan. Phoronix spotted the new Linux driver and highlighted the importance of defect discovery in server processors prior to deployment or before working on critical tasks.

Servers already boast parity or ECC checks as standard features for RAM, and there are similar checks on data going to and from storage, across networks, and more. However, the new In-Field Scan feature specifically targets the processor. To get the new In-Field Scan system to work, Intel Sapphire Rapids Xeon CPUs will have integrated checking features, and the In-Field Scan kernel driver will provide a user interface for the checks.

After delving deeper into the software, the source publication says that the Intel In-Field Scan driver can instigate full CPU tests. There is also a more granular level of control available, allowing testing for each separate CPU core. These scan results are then saved in log files.

When In-Field Scan becomes available, system builders or administrators might run it before commissioning a server, before specific critical tasks are run on the server, or simply on a given schedule. The Linux driver is out now, but the CPU-specific test files that reveal more about the checks and CPU characteristics aren't currently available.

Why Do We Need In-Field Scan Now?

With the growth of massive data centers thanks to the internet economy, streaming video/gaming and cloud computing, manageability is increasingly important to the smooth running of servers. For example, if specific CPUs or cores can self-report errors, they can be easily found and swapped.

Another movement in technology that might make In-Field Scan an important tool is the race to the Angstrom era. As chip features get smaller in the search for higher density, performance and efficiency, they become more susceptible to known errors and unexplained errors – sometimes called soft errors.

Soft errors might occur more often in our most state-of-the-art chips due to pushing physics and the physical size of transistors to their extremes.

Some think that the errors may not emanate from the new smaller chip processes but simply the susceptibility of these tiny structures to the nature of the universe (i.e., cosmic rays).

A year ago, news hit the wires that the cutting-edge NASA space exploration vehicle,  Mars Perseverance, was running the same single-core Power PC CPU that powered the Bondi Blue iMac back in 1998. In brief, the reason for this apparent tech mismatch was that the RAD 750 (based on the PowerPC 750) was hardened to withstand up to 1,000,000 Rads and extreme temperatures. Moreover, its larger process logic gates were far less susceptible to cosmic ray interference than modern CPUs. Of course, our atmosphere reduces cosmic radiation on earth, but it is still there.

Interestingly, Intel interpolates the error susceptibility of its chips and ICs concerning cosmic ray interference using a particle accelerator at the Los Alamos Neutron Science Center.  

Mark Tyson
Freelance News Writer

Mark Tyson is a Freelance News Writer at Tom's Hardware US. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.