Sign in with
Sign up | Sign in
Your question

Overheating problems on Skulltrail System

Last response: in CPUs
Share
November 25, 2009 11:28:43 AM

Hi all,

I wonder if anyone can suggest some things to check and investigate regarding the following problem:

We have a Skulltrail system with two Xeon 5420s installed. We're not overclocking or anything adventurous like that but for whatever reason one of the CPUs (CPU1) is running hot. CPU0 runs at around 36<40C when idle and CPU1 in the 60<70C range.

60<70C is well within the operating range of the chip but something's wrong! Currently the BIOS and OSes show only a single CPU (or four cores as opposed to eight).

Furthermore, sometimes the system will not boot at all -when it does only one CPU show in the OS.

We had the same problem a week ago and put it down a poor heatsink/fan install so we did a reinstall and everything seemed to be fine again but after stressing the system overnight the symptoms have returned.

Prior to last week we had not noticed anything wrong, although that's not say there wasn't anything wrong. The system has been being used relatively intensly for about six months. All components are pretty much top-of-the-range and to the best of our knowledge everything has been installed correctly.

Help!! Anybody have any ideas for things to check, diagnostics to run or perhaps experiance with dodgy Skulltrail boards or Xeon chips??


Full(ish) Spec:
-Skulltrail D5400XS
-2 x Xeon 5420 2.5GHz
-1200WATT PSU
-16GB RAM
-2 x NVidia Tesla
-1 x NVidia GTX 296
a b à CPUs
November 26, 2009 9:57:55 AM

Sounds to me like it could be a bad chip or motherboard. To test it, try switching the chips around, i.e. put the hot one in the slot of the good one and vice-versa. If it is still running hot and the other symptoms persist, you know it isn't the motherboard. Intel has a 3 year warranty on their chips I believe, so if worse comes to worse you should be able to get another one for free.
November 26, 2009 1:45:36 PM

buwish said:
Sounds to me like it could be a bad chip or motherboard. To test it, try switching the chips around, i.e. put the hot one in the slot of the good one and vice-versa. If it is still running hot and the other symptoms persist, you know it isn't the motherboard. Intel has a 3 year warranty on their chips I believe, so if worse comes to worse you should be able to get another one for free.


I agree, that should narrow it down to a chip or the board... personally i'd hope it's a chip, RMAing the board'll take a while and you won't have a working system in the mean time.
Related resources
a b à CPUs
November 26, 2009 4:31:52 PM

what case?
a c 207 à CPUs
November 26, 2009 4:51:56 PM

The one chip thing eliminates outside issues like case size and air flow, so it's gotta be HS, CPU or MoBo related. Mixing and matching as indicate by endo seems like the best bet.
a b à CPUs
November 26, 2009 4:59:00 PM

... and, call Intel for their recommendation(s): can't hurt.


BTW: is this your HSF?

http://www.intel.com/support/processors/xeon5k/sb/CS-02...


Another thing to check is the CPU core voltage setting in the BIOS:
it may be that a spurious error resulted in upping the core voltage
on the hot CPU: this is easy enough to check.

I mention the latter only because we saw a random error in our
BIOS recently: the memory latency settings changed spontaneously
and in conflict with the JEDEC settings in the SPD chip:
as soon as we isolated that change, we set the DRAM back to AUTO
and stability returned.


MRFS
November 26, 2009 7:29:31 PM

Thanks for all your replies!

@obsidian86 -Cooler Master Cosmos

@MRFS - Yup! Checked those, all good...



Yesterday after posting I switched chips as some of you suggested. To do this I had to completely strip the system because the heatsink/fans are fastened on the back of the mobo, which isn't very practical. So after switching the chips but not a full rebuild I ran an OpenMP app I have which is known to push the system. CPU1 ran slightly hotter but didn't go over 53C -recall that CPU1 was the suspect socket. CPU0 didn't get over 51C. Not bad.

I decided to rebuild. After rebuilding I noted that the chipset and CPU1 were noticeably hotter. To cut a long story short, the GPUs are the problem. Even when idle they get way too hot (60C). Not only that but they obstruct the fan on the chipset which they almost touch and all of this is right below CPU1.

Today I took out the middle GPU making some breathing space and ran with a desk fan blowing in the open side of the machine. It's much better like that and I'll install a fan in the side panel tomorrow.


Not much space, or air..
a b à CPUs
November 26, 2009 8:08:21 PM

Good for you! Just a few comments, because I can't confirm
from your photo these points (and forgive me for my lack
of complete information):


(1) PSU should intake cooler air from the fan grill in the bottom panel,
and exhaust out the rear; so, its intake fan should be pointing DOWN,
which is how you have it:

http://www.newegg.com/Product/Product.aspx?Item=N82E168...


(2) I'm not totally familiar with the Tesla's cooling: they should be
exhausting out the rear panel, like these:

http://www.nvidia.com/object/product_tesla_C2050_C2070_...

And, that's how you appear to have them installed.

So far, so good. Your experiment tells us a LOT!! :) 


(3) I would recommend adding 2 strong fans in the left-side panel:
one to force air into the gaps between your GPUs, and
one to force air into the space between the CPUs and upper GPU;
the ideal is to feed each heat source with cooler air, and
to exhaust that warmer air from each heat source without
warming any other interior components with that exhaust air;


(4) I'm also not totally familiar with that chassis: the photos at Newegg
show 2 exhaust fan grills in the top panel: you should have at least
one large fan installed there also, ideally 2 high CFM fans:

http://www.newegg.com/Product/Product.aspx?Item=N82E168...


(5) if you have an empty 5.25" drive bay, I would add an intake fan there,
or at least remove one or more bay covers, in order to allow more cooler
air to enter at the front panel;


(6) all of the new fans that you add should have a variable speed
switch, to permit you to "tune" the speed of those fans;
a single-speed fan may not be fast enough at its single speed:
so, pay attention to CFM when you select these supplemental fans;


(7) make an effort to balance INTAKE CFM with EXHAUST CFM:
a wide variance will be less than ideal: you want constant,
even air flow around all interior heat sources, and the
exhaust from any given component(s) should not be
heating any other interior component(s);


(8) the photos at Newegg also show an optional intake fan
in the bottom panel: I don't see it in your photo, however:
http://www.newegg.com/Product/Product.aspx?Item=N82E168...
that's an excellent place to add an intake fan, because hot air rises
and cold air falls: thus, the air below the bottom panel should be
THE COLDEST AIR of any in contact with all 6 chassis sides.


I hope this helps.


MRFS
a b à CPUs
November 26, 2009 8:23:47 PM

p.s. Here's one more idea: a baffle between the
top-most GPU and your CPUs might effectively "insulate"
these two interior regions. One way to do that
is to add a standard slot fan to your top (empty) PCI slot
e.g.:

http://www.newegg.com/Product/Product.aspx?Item=N82E168...
http://www.newegg.com/Product/Product.aspx?Item=N82E168...


Here's another one, but I can't remember if it
intakes air from the rear panel: if so, this
one would NOT work because you already
have lots of hot air rising up along the rear panel:
as such, this one would suck that warm air back in
(NOT GOOD):

http://www.newegg.com/Product/Product.aspx?Item=N82E168...
http://www.azenx.com/BT_SC70BBL_Flyer.pdf


UPDATE: I remember now installing it in our aging storage server: 3.2GHz "PressHot":
I just booted it up for you, and I confirm that it does EXHAUST OUT the rear panel:
so, it would be ideal for your top-most empty PCI slot, particularly with its variable potentiometer
(if I'm reading your photo correctly :) 


GOOD LUCK!


MRFS
November 27, 2009 5:13:50 PM

Thanks for the response, my comments inline....


MRFS said:
Good for you! Just a few comments, because I can't confirm
from your photo these points (and forgive me for my lack
of complete information):


(1) PSU should intake cooler air from the fan grill in the bottom panel,
and exhaust out the rear; so, its intake fan should be pointing DOWN,
which is how you have it:

http://www.newegg.com/Product/Product.aspx?Item=N82E168...

>>>Yes, it's sucking in from beneath and blowing out the back.



(2) I'm not totally familiar with the Tesla's cooling: they should be
exhausting out the rear panel, like these:

http://www.nvidia.com/object/product_tesla_C2050_C2070_...

And, that's how you appear to have them installed.

So far, so good. Your experiment tells us a LOT!! :) 

>>> That's right. The Tesla's have an internal intake fan which I assume blows
accross the chips and out the back. I've actuaklly removed the middle GPU you
see in the pic to improve air flow to the chip-set fan. This also improves air
flow to the remaining Tesla and the GTX.


(3) I would recommend adding 2 strong fans in the left-side panel:
one to force air into the gaps between your GPUs, and
one to force air into the space between the CPUs and upper GPU;
the ideal is to feed each heat source with cooler air, and
to exhaust that warmer air from each heat source without
warming any other interior components with that exhaust air;

>>>The case is actually the Cosmos "S" which is similar to the Cosmos but instead of
solid metal it's like a grill all over. The left side panel actualy has a big 8-9
inch fan built in. However, I removed it when we originally put the system
together because the CPU heatsinks were too big and obstructed the fan.

Last night I took a saw to the fan and hacked a few bits off so that I could
reinstall it today. I also added a fan in the base, as you suggested, which is
sucking air in from the floor and another in the front filling the lowermost three
drive bays, also sucking in.


(4) I'm also not totally familiar with that chassis: the photos at Newegg
show 2 exhaust fan grills in the top panel: you should have at least
one large fan installed there also, ideally 2 high CFM fans:

http://www.newegg.com/Product/Product.aspx?Item=N82E168...

>>>It's the Cosmos S and it has one fan in the top blowing out.


(5) if you have an empty 5.25" drive bay, I would add an intake fan there,
or at least remove one or more bay covers, in order to allow more cooler
air to enter at the front panel;

>>>See above :) 


(6) all of the new fans that you add should have a variable speed
switch, to permit you to "tune" the speed of those fans;
a single-speed fan may not be fast enough at its single speed:
so, pay attention to CFM when you select these supplemental fans;

>>> hmmm... Well all the fans are the same so I hope that's going to be ok.


(7) make an effort to balance INTAKE CFM with EXHAUST CFM:
a wide variance will be less than ideal: you want constant,
even air flow around all interior heat sources, and the
exhaust from any given component(s) should not be
heating any other interior component(s);

>>> Thats a point! I have a front panel fan blowing on CPU1 which blows on CPU0
which in turn blows towards a rear panel fan blowing out. Potentially
sub-optimal? Maybe. But, CPU0 was never a problem so I'll ignore that for now.


(8) the photos at Newegg also show an optional intake fan
in the bottom panel: I don't see it in your photo, however:
http://www.newegg.com/Product/Product.aspx?Item=N82E168...
that's an excellent place to add an intake fan, because hot air rises
and cold air falls: thus, the air below the bottom panel should be
THE COLDEST AIR of any in contact with all 6 chassis sides.

>>> Yup! Added one today


I hope this helps.

>>> It all helps!


MRFS


So, to summarize, I've removed one GPU and added three fans. The result, the GPUs
are cooler when idle. When idle they're hanging around the 50C mark which is pretty
reasonable. I just a pretty heavy burn-in with all three (CUDA treats the GTX 295 as
two separate devices) running flat out for fifteen minutes and the highest they got
was 72C which is pretty good compared with 80C+ a few days ago.

The side fan seems to help a lot. The DIMMs were giving readings of about 70C before
and now are more like 55<65C, idle and working repectively.

On a more general note I'd add the following. This system is based on NVidias
recomendations for a "roll-your-own" Tesla personal supercomputer. They recommend
the Skulltrail for a three Tesla system with a cheapish Quadro card for graphics.
(Skulltrail has no on-board graphics). As you can see, we went for 2x Tesla and the
GTX 295. From personal experiance and talking to others in the past few days I'd say
that building a multi-Tesla system is no easy thing.

The problem is space and cooling, which is actually one problem. When I designed
this system the mobo question was whether to go with a single socket FoxCon
Destroyer or dual socket Skulltrail. The Destroyer has space for four Tesla, or
other large form factor GPUs. In either case you end up stacking them very, very
close together. This is bad.

We didn't have any problems for a long time because it's a test machine and untill
recently our tests have been small. We're now doing long intensive runs utilising
all GPUs and they get really hot.

If people can learn anything from this it is:.. If you build a multi-gpu system
cooling is going to be your problem. Despite NVidia's recomendations, don't stack
your GPUs so close together because they won't be able to breath and will generate
so much heat you might end up with problems elsewhere in the system like I did.
MRFS's suggestion to use a PCI mouted cooling solution is a good one. Stacking GPU
then slot fan, then GPU and so on would be smart thing to do but then where are you
gonna find a mobo with enough slots?
November 27, 2009 5:49:56 PM

Have you ever thought about water-cooling? Normally, it is low-profile (as for the CPUs), relatively quiet and dissipate a lot more heat than air. Since you have three GPUs, I suggest you to do that. Maybe you only need to WC your GPUs, doing one or two loops, since they generate a lot of heat, by what you're saying...
November 27, 2009 5:55:26 PM

@ zehpavora

Not really. Water cooling is more of a CPU overclockers thing right? Excuse my ignorance but is there a water cooling solution for GPUs?

To add some context to this whole thread I'm not really a hardcore hardware kinda guy, I just know how to put a machine together, hence my underestimation of the cooling requirements for this kind of rig.
November 28, 2009 12:17:49 AM

Well, WC IS directed to the overclockers. However, you don't have to be one to use it. You can use water in everything from CPUs to PSUs (I've seen it...).

It gets a little expensive depending on the size of the build, but it would be better if you watercooled your cards, so when running your cards, they will be quite cool.

Your computer may have difficulties to boot because of heat, so try proving this point by removing the cards, doing some cable management, providing better airflow, or even changing the cooling setup, as you've already been told.

Do what you think its best. I hope we were helpful.
!