We have a workstation with S4985 G3NR-SI. It runs 48 opteron cores with 96gb of RAM. It is rebooting randomly when doing intensive processes. I've isolated the problem to the M4985-SI ad-on board, since the system runs fine without it. I've also tested the memory with memtest86+ 4.00 and it comes out fine.
What other test can I run to be sure where the problem is? The M4985-SI is not cheap and I would like to be certain before buying a replacement board. How can I test for CPU problems and ID which CPU is failing?
I'm trying to be certain if possible, since it's a pricey part. Wish it was memory, but yesterday I swapped memory between main board and ad-on board. I tested with main board only and memtest86 ran for 24 hours without errors. So more and more it seems to be the ad-on board issue.
Should I swap/test cpu's? I believe cpu problems would be more obvious and detectable.
I downloaded a stress test program called y-cruncher. It can run memory stress routines calculating huge pi numbers. I ran the y-cruncher stress test mode and was able to duplicate the problem. The system crashed and rebooted. This seems to point or confirm that the problem is faulty memory on the M4985-SI ad-on board.
A few day ago I entered the BIOS setup and lowered the memory speed from 667mhz to 533mhz and also disable memory cache buffers. This lowered the system performance somewhat, but made it more stable. I was able to run the y-cruncher stress test and the system held up.