Overheating problems on Skulltrail System

timmiroon

Distinguished
Nov 25, 2009
7
0
18,510
Hi all,

I wonder if anyone can suggest some things to check and investigate regarding the following problem:

We have a Skulltrail system with two Xeon 5420s installed. We're not overclocking or anything adventurous like that but for whatever reason one of the CPUs (CPU1) is running hot. CPU0 runs at around 36<40C when idle and CPU1 in the 60<70C range.

60<70C is well within the operating range of the chip but something's wrong! Currently the BIOS and OSes show only a single CPU (or four cores as opposed to eight).

Furthermore, sometimes the system will not boot at all -when it does only one CPU show in the OS.

We had the same problem a week ago and put it down a poor heatsink/fan install so we did a reinstall and everything seemed to be fine again but after stressing the system overnight the symptoms have returned.

Prior to last week we had not noticed anything wrong, although that's not say there wasn't anything wrong. The system has been being used relatively intensly for about six months. All components are pretty much top-of-the-range and to the best of our knowledge everything has been installed correctly.

Help!! Anybody have any ideas for things to check, diagnostics to run or perhaps experiance with dodgy Skulltrail boards or Xeon chips??


Full(ish) Spec:
-Skulltrail D5400XS
-2 x Xeon 5420 2.5GHz
-1200WATT PSU
-16GB RAM
-2 x NVidia Tesla
-1 x NVidia GTX 296
 
Sounds to me like it could be a bad chip or motherboard. To test it, try switching the chips around, i.e. put the hot one in the slot of the good one and vice-versa. If it is still running hot and the other symptoms persist, you know it isn't the motherboard. Intel has a 3 year warranty on their chips I believe, so if worse comes to worse you should be able to get another one for free.
 

endorphines

Distinguished
Mar 11, 2008
68
0
18,640


I agree, that should narrow it down to a chip or the board... personally i'd hope it's a chip, RMAing the board'll take a while and you won't have a working system in the mean time.
 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
... and, call Intel for their recommendation(s): can't hurt.


BTW: is this your HSF?

http://www.intel.com/support/processors/xeon5k/sb/CS-022301.htm


Another thing to check is the CPU core voltage setting in the BIOS:
it may be that a spurious error resulted in upping the core voltage
on the hot CPU: this is easy enough to check.

I mention the latter only because we saw a random error in our
BIOS recently: the memory latency settings changed spontaneously
and in conflict with the JEDEC settings in the SPD chip:
as soon as we isolated that change, we set the DRAM back to AUTO
and stability returned.


MRFS
 

timmiroon

Distinguished
Nov 25, 2009
7
0
18,510
Thanks for all your replies!

@obsidian86 -Cooler Master Cosmos

@MRFS - Yup! Checked those, all good...



Yesterday after posting I switched chips as some of you suggested. To do this I had to completely strip the system because the heatsink/fans are fastened on the back of the mobo, which isn't very practical. So after switching the chips but not a full rebuild I ran an OpenMP app I have which is known to push the system. CPU1 ran slightly hotter but didn't go over 53C -recall that CPU1 was the suspect socket. CPU0 didn't get over 51C. Not bad.

I decided to rebuild. After rebuilding I noted that the chipset and CPU1 were noticeably hotter. To cut a long story short, the GPUs are the problem. Even when idle they get way too hot (60C). Not only that but they obstruct the fan on the chipset which they almost touch and all of this is right below CPU1.

Today I took out the middle GPU making some breathing space and ran with a desk fan blowing in the open side of the machine. It's much better like that and I'll install a fan in the side panel tomorrow.

Photo-0024b.jpg

Not much space, or air..
 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
Good for you! Just a few comments, because I can't confirm
from your photo these points (and forgive me for my lack
of complete information):


(1) PSU should intake cooler air from the fan grill in the bottom panel,
and exhaust out the rear; so, its intake fan should be pointing DOWN,
which is how you have it:

http://www.newegg.com/Product/Product.aspx?Item=N82E16817202018&Tpk=chieftec


(2) I'm not totally familiar with the Tesla's cooling: they should be
exhausting out the rear panel, like these:

http://www.nvidia.com/object/product_tesla_C2050_C2070_us.html

And, that's how you appear to have them installed.

So far, so good. Your experiment tells us a LOT!! :)


(3) I would recommend adding 2 strong fans in the left-side panel:
one to force air into the gaps between your GPUs, and
one to force air into the space between the CPUs and upper GPU;
the ideal is to feed each heat source with cooler air, and
to exhaust that warmer air from each heat source without
warming any other interior components with that exhaust air;


(4) I'm also not totally familiar with that chassis: the photos at Newegg
show 2 exhaust fan grills in the top panel: you should have at least
one large fan installed there also, ideally 2 high CFM fans:

http://www.newegg.com/Product/Product.aspx?Item=N82E16811119138&Tpk=N82E16811119138


(5) if you have an empty 5.25" drive bay, I would add an intake fan there,
or at least remove one or more bay covers, in order to allow more cooler
air to enter at the front panel;


(6) all of the new fans that you add should have a variable speed
switch, to permit you to "tune" the speed of those fans;
a single-speed fan may not be fast enough at its single speed:
so, pay attention to CFM when you select these supplemental fans;


(7) make an effort to balance INTAKE CFM with EXHAUST CFM:
a wide variance will be less than ideal: you want constant,
even air flow around all interior heat sources, and the
exhaust from any given component(s) should not be
heating any other interior component(s);


(8) the photos at Newegg also show an optional intake fan
in the bottom panel: I don't see it in your photo, however:
http://www.newegg.com/Product/Product.aspx?Item=N82E16811119138&Tpk=N82E16811119138
that's an excellent place to add an intake fan, because hot air rises
and cold air falls: thus, the air below the bottom panel should be
THE COLDEST AIR of any in contact with all 6 chassis sides.


I hope this helps.


MRFS
 

MRFS

Distinguished
Dec 13, 2008
1,333
0
19,360
p.s. Here's one more idea: a baffle between the
top-most GPU and your CPUs might effectively "insulate"
these two interior regions. One way to do that
is to add a standard slot fan to your top (empty) PCI slot
e.g.:

http://www.newegg.com/Product/Product.aspx?Item=N82E16835129025&Tpk=N82E16835129025
http://www.newegg.com/Product/Product.aspx?Item=N82E16811999704&Tpk=N82E16811999704


Here's another one, but I can't remember if it
intakes air from the rear panel: if so, this
one would NOT work because you already
have lots of hot air rising up along the rear panel:
as such, this one would suck that warm air back in
(NOT GOOD):

http://www.newegg.com/Product/Product.aspx?Item=N82E16835114024&Tpk=N82E16835114024
http://www.azenx.com/BT_SC70BBL_Flyer.pdf


UPDATE: I remember now installing it in our aging storage server: 3.2GHz "PressHot":
I just booted it up for you, and I confirm that it does EXHAUST OUT the rear panel:
so, it would be ideal for your top-most empty PCI slot, particularly with its variable potentiometer
(if I'm reading your photo correctly :)


GOOD LUCK!


MRFS
 

timmiroon

Distinguished
Nov 25, 2009
7
0
18,510
Thanks for the response, my comments inline....




So, to summarize, I've removed one GPU and added three fans. The result, the GPUs
are cooler when idle. When idle they're hanging around the 50C mark which is pretty
reasonable. I just a pretty heavy burn-in with all three (CUDA treats the GTX 295 as
two separate devices) running flat out for fifteen minutes and the highest they got
was 72C which is pretty good compared with 80C+ a few days ago.

The side fan seems to help a lot. The DIMMs were giving readings of about 70C before
and now are more like 55<65C, idle and working repectively.

On a more general note I'd add the following. This system is based on NVidias
recomendations for a "roll-your-own" Tesla personal supercomputer. They recommend
the Skulltrail for a three Tesla system with a cheapish Quadro card for graphics.
(Skulltrail has no on-board graphics). As you can see, we went for 2x Tesla and the
GTX 295. From personal experiance and talking to others in the past few days I'd say
that building a multi-Tesla system is no easy thing.

The problem is space and cooling, which is actually one problem. When I designed
this system the mobo question was whether to go with a single socket FoxCon
Destroyer or dual socket Skulltrail. The Destroyer has space for four Tesla, or
other large form factor GPUs. In either case you end up stacking them very, very
close together. This is bad.

We didn't have any problems for a long time because it's a test machine and untill
recently our tests have been small. We're now doing long intensive runs utilising
all GPUs and they get really hot.

If people can learn anything from this it is:.. If you build a multi-gpu system
cooling is going to be your problem. Despite NVidia's recomendations, don't stack
your GPUs so close together because they won't be able to breath and will generate
so much heat you might end up with problems elsewhere in the system like I did.
MRFS's suggestion to use a PCI mouted cooling solution is a good one. Stacking GPU
then slot fan, then GPU and so on would be smart thing to do but then where are you
gonna find a mobo with enough slots?
 

zehpavora

Distinguished
Apr 1, 2009
91
0
18,630
Have you ever thought about water-cooling? Normally, it is low-profile (as for the CPUs), relatively quiet and dissipate a lot more heat than air. Since you have three GPUs, I suggest you to do that. Maybe you only need to WC your GPUs, doing one or two loops, since they generate a lot of heat, by what you're saying...
 

timmiroon

Distinguished
Nov 25, 2009
7
0
18,510
@ zehpavora

Not really. Water cooling is more of a CPU overclockers thing right? Excuse my ignorance but is there a water cooling solution for GPUs?

To add some context to this whole thread I'm not really a hardcore hardware kinda guy, I just know how to put a machine together, hence my underestimation of the cooling requirements for this kind of rig.
 

zehpavora

Distinguished
Apr 1, 2009
91
0
18,630
Well, WC IS directed to the overclockers. However, you don't have to be one to use it. You can use water in everything from CPUs to PSUs (I've seen it...).

It gets a little expensive depending on the size of the build, but it would be better if you watercooled your cards, so when running your cards, they will be quite cool.

Your computer may have difficulties to boot because of heat, so try proving this point by removing the cards, doing some cable management, providing better airflow, or even changing the cooling setup, as you've already been told.

Do what you think its best. I hope we were helpful.