ECC memory causing random server reboots

jamesmarkoff

Reputable
Jun 15, 2015
2
0
4,510
I'm running ubuntu server 14.04 on Supermicro X10SLM-F / Xeon E3-1271 v3
Memory: SuperTalent 32GB DDR3 1600 ECC

About every 4 days, the logs on Ubuntu will show this:
Code:
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]:  Error 0, type: corrected
{1}[Hardware Error]:  fru_text: CorrectedErr
{1}[Hardware Error]:   section_type: memory error
[Firmware Warn]: error section length is too small

Immediately after this the server reboots itself in a "power-cycle" fashion.

When I look in the BIOS event log, I see this:
Code:
DATE            TIME           ERROR CODE      SEVERITY
06/13/15      13:13:38      Smbios 0x02         P1-DIMMB2

And the description of the error is:
Code:
Single Bit ECC Memory Error

A few questions:
1. If the ECC memory is self correcting, why does the machine reboot itself?
2. Am I, perhaps, missing some setting in the BIOS that will stop the box from rebooting itself?
3. Is this obviously a memory stick issue or can this be a slot issue or a CPU issue?

Thank you for any advice.
 

jamesmarkoff

Reputable
Jun 15, 2015
2
0
4,510
It did come in a single package. This also happens to some other machines on my network, which are also Supermicro. Currently I am working on replacing the RAM with the tested RAM for this specific board.
http://www.supermicro.com/products/motherboard/xeon/c220/x10slm-f.cfm
It seems like the only supported brands are: Hynix and Samsung. Super Talent is not on the list.

Because the problem is so common, I don't think replacing a specific stick will do the trick.