Tyan S4985 G3NR-SI + M4985-SI


We have a workstation with S4985 G3NR-SI. It runs 48 opteron cores with 96gb of RAM. It is rebooting randomly when doing intensive processes. I've isolated the problem to the M4985-SI ad-on board, since the system runs fine without it. I've also tested the memory with memtest86+ 4.00 and it comes out fine.

What other test can I run to be sure where the problem is? The M4985-SI is not cheap and I would like to be certain before buying a replacement board. How can I test for CPU problems and ID which CPU is failing?

System is running OpenSuse 11.4

Thanks for the help!
  1. So you are talking the CPU add on board for 8 socket solution? Have you considered that it might be a power issue with it in?
  2. Yes, it's the 8 socket solution. The workstation has 4 redundant 1400w P/S.
  3. So we can say no power issue, then the card looks to be the problem.
  4. I'm trying to be certain if possible, since it's a pricey part. Wish it was memory, but yesterday I swapped memory between main board and ad-on board. I tested with main board only and memtest86 ran for 24 hours without errors. So more and more it seems to be the ad-on board issue.

    Should I swap/test cpu's? I believe cpu problems would be more obvious and detectable.
    It is the only test you have not tried so I would say so.
  6. I downloaded a stress test program called y-cruncher. It can run memory stress routines calculating huge pi numbers. I ran the y-cruncher stress test mode and was able to duplicate the problem. The system crashed and rebooted. This seems to point or confirm that the problem is faulty memory on the M4985-SI ad-on board.

    A few day ago I entered the BIOS setup and lowered the memory speed from 667mhz to 533mhz and also disable memory cache buffers. This lowered the system performance somewhat, but made it more stable. I was able to run the y-cruncher stress test and the system held up.

    So I will be replacing the faulty memory soon.
