Sign in with
Sign up | Sign in
Your question

Server shutting down abrubtly every day, power-supplies seem to be damaged

Tags:
  • Power Supplies
  • Business Computing
  • Servers
Last response: in Business Computing
Share
June 30, 2014 6:08:56 AM

I have an aggravating issue with our terminal servers (a proliant ml350 g6 running 2008 R2) . A few weeks ago it started to shutdown suddenly about once a day. The only error in the event viewer is the critical kernel-power entry stating that the system rebooted without cleanly shutting down first, and the pop-up at log-in stating the "system shut down unexpectedly".

Our first thought was the power supply. We have an identical server we that we use for experimentation with a known good power supply. So we swapped them out. At that point the experimental server started going down once a day, while the terminal server stopped having problems. So we thought that was the problem.

Except now the terminal server started to have the same problem again. So now we have two servers that are shutting down everyday. All three of our production servers are plugged into independent UPSs and are in the same location. The other two production servers have had no issues. Our experimental server is located off-site. We really don't have many ideas at this point, other than replacing the UPS the terminal server is using. We are reluctant to purchase a new power-supply without knowing what caused the problem in the first place, we don't want it to get damaged as the other two seemed to be. We had an electrician come and check the wall sockets but he couldn't find any problems. Anyone have any ideas?

More about : server shutting abrubtly day power supplies damaged

a c 126 ) Power supply
June 30, 2014 6:15:24 AM

Do they shut down at about the same time each day? Sounds like the old case of the cleaner unplugging it for the vacuum...
m
0
l
June 30, 2014 5:22:58 PM

That is very hard to say, as diagnostically speaking the random power off and only "Kernel Power" entry in the event log can be caused by such a huge list of things, and that includes both hardware or software related issues. My first thought would also be power supply. Standard ATX power supplies can be tested with relatively good accuracy using a cheap PSU tester to see if they are experiencing the majority of the failures experienced by a PSU. However, a server PSU can't just be connected into these as easily. About the only option here would be purchase another PSU and try that.

However, that may very well not be the issue just like you are stating. It could be motherboard issues as well, which is much more difficult to determine. But I know of software issues that can cause this, even firmware or driver versions needing to be updated. I know that if you contact into HP on it, that's going to be one of the main things they ask for is a report from the iLO system status or health status reports, and if you have all the latest firmware for your hardware devices.

I do believe the ML350 G6 still has iLO capabilities, so have you tried to log in there and see if there is any kind of warnings or hardware related errors listed in there or the associated health reports? I have a newer ML310e G8 Version 2 server recently that was doing the same thing, but it would only power off randomly (sometimes twice a day, sometimes once a week.) I checked the hardware logs using iLO and while that did show some errors, nothing related or associated to any hardware actually causing it, only reacting to the failure (sudden power loss.) However, it did show me that there was a detected error, and that was my iLO firmware was behind the most current recommended. Sure enough, after contacting HP for confirmation, they have reported that certain older firmware versions of iLO had been causing this exact issue and suggested upgrading the firmware.
m
0
l
July 2, 2014 3:34:56 AM

The event log kernel-power is something general , actually it does not say you have a problem in any electric system , only let you know server was shutdown / reboot unexpectedly.
Since you tried another server and got the same issues just prove there is a problem that is not likely have to do with the servers hardware. you do have server hardware checkup utility , and event log in server bios.
I suggest you look / do the following :
1. check server hardware log
2. run server hardware checkup
3. check ups / uninstall ups agent from server (for testing purpose)
4. check for recent software / drivers / OS upgrades (usually the cause for such issues)
5. run antivirus / anti-malware
6. try investigate what your users running
7. try disable any auto running jobs
8. upgrade drivers
9. try isolate server for 24 hours without network connection and or leave it in bios screen (os not loaded)
10. does ilo connected ? try disconnect it

now run this tests 1 by 1 and not together for better diagnostic purpose.

you might want contact hp support if you have a valid support contact

good luck
m
0
l
!