Yesterday, I had some guys in cabling some new cubes here at work. When they were working in the ceiling, they accidentally shut off the Air Conditioning for my server room.
This room holds 20 x86 servers (some of them dual processor), 1 IBM AS/400, the telephone switching equipment, switches, routers, csu/dsu's and hubs.
The only reason I found out there was even a problem was I went to get some cable from the server room for a co-worker. As I approached the locked door, I heard an alarm beep coming from within. I punched the code and was blasted by a wall of 100F air coming out of the room.
Only one of my servers was sounding an alarm and all my LCD displays were darkened by the added temperature in the room. I proceeded to turn off the server that was close to crashing and waited until the Air Conditioning repair man fixed the problem before turning it back on.
After turning it on, I checked the BIOS to see how hot the processor had to get before the alarm came on and how hot before the system got before it completely shut down.
At 65C it sounds an alarm, and at 70C it shuts down the computer. This particular computer runs the payroll database software for my entire company and if it had crashed I would have spent many hours on the phone with their tech support.
I don't know how the computer was even working because it had to be running at 65C-70C for at least 6 hours but no one even noticed that anything was wrong as it was chugging away just as it always does.
Then I saw my good old dual P-Pro server with an NT4.0 blue screen and I almost [-peep-] a brick. It looked to me to have suffered irreperable damage.
I guess I will have to check my server room every morning instead of only checking on accident like I did this morning. I usually do my work with PCAnywhere if I need to work on the servers, but what saved me was that I needed something physically from the room.
Might be a good idea to invest in a temperature alarm for the room and hook it into your company's fire alarm system. While it wouldn't have to trigger a real fire alarm, it would certainly make someone notice. I don't think this would be particularily expensive especially compared to the costs of the tech support nightmare you might have if this happened again.
One piece of software I saw recently sends a SMS mobile phone text message via the Internet when the computer gets too hot. Here in the UK, mobile coverage is 99% so could reach me (or a coworker) wherever we may be. I can't remember the name, but if you're interested, I could find it...
Advice I often hear is to keep all your backup servers (I presume you have these!) in a separate physical location, several rooms away with a separate air conditioning system. This way you'll be OK even if someone explodes a grenade in your primary server room.
its available in almost all intel server platforms as PEP - Platform Event Paging - it can page or send a SMS msg to the service engineer in such events
actually i guess any board could be configured to do this by adding such code to BIOS and keeping on COM port and a modem always available.
<font color=blue>die-hard fans don't have heat-sinks!</font color=blue>
With the new P4 systems, if this had happened, the systems would not crash, but would merely slow down to lower the operating temperature, and give you a warning. A complete system halt is better than frying the CPU, but it still doesn't help when when your database is corrupted and your data is lost.
-Raystonn
= The views stated herein are my personal views, and not necessarily the views of my employer. =
You are about to answer a thread that has been inactive for more than 6 months. If you still wish to proceed, please ensure that your posting is original and does not duplicate or overlap any prior responses to this thread.