Mandriva Server Issues - Possible dead drive?

NightHawk21

Distinguished
Aug 11, 2011
15
0
18,510
We have a small server for our lab (Mandriva 2009 spring), which we use to store our workbooks and access them in different labs. For about the last 1-2 weeks there has been an occasional clicking coming from inside the server (typically 2 clicks then a break).

Sometime around 9am on Tuesday morning, the server could no longer be mapped in Windows and could no longer be connected as a windows share on our Ubuntu comps. The server could still be connected to using ssh in the terminal or through the connect to server menu if the type was set to ssh.

The server itself is connected to a screen and keyboard so it can be logged into directly if necessary. I did this, but instead of being presented with a desktop, I saw a blue and white checkerboard background with large red X's on the top of the screen. The red X's weren't clickable, but lit up when you scrolled over them.

I pulled down the majority of the more important files, and attempted to reboot the system in a virtual terminal (opened using ctrl+alt+f1), using the reboot command. The server started shutting down, but hung on a step that said: "INIT: no more processes left in this runlevel." I left it here for a while, tried rebooting from an ssh terminal I had on my personal computer (no change), but ultimately I powered it down using the power button after it was clearly hung.

On reboot the system appears to boot normally until after the mandriva loading screen where it crashes on a black screen with white text (last few lines transcribed below):

waiting for root device /dev/sda1 to appear (timeout 1min)
Creating root device
Mounting root filesystem
mount: error mounting /dev/root on /sysroot as ext4: Invalid argument
Setting up other filesystems.
setuproot: moving /dev failed: No such file or directory
setuproot: error mounting /sys: No such file or directory
Switching to new root and running init.
switchroot: /dev does not exist in new root
Booting has failed.
Kernel panic - not syncing: Attempted to kill init!

The server itself has 2 internal drives, which are copies of each other. It is also connected to a backup. I entered the BIOS and tried booting off the second drive, but crashed on the same screen except my errors had to do with UUIDs (transcribed below):

waiting for root devide UUID=1772619c6-2a21-4428-a29f-93f607cc28db to appear (timeout 1 min)
Could not find resume device (UUID=e27b085d-ef7c-42cf-94e6-6c76b73f15bd)
Could not resolve resume device (UUID=e27b085d-ef7c-42cf-94e6-6c76b73f15bd)
Creating root device
Mounting root filesystem
mount: could not find filesystem '/dev/root'
Setting up other filesystems.
setuproot: moving /dev failed: No such file or directory
setuproot: moving /proc failed: No such file or dir
setuproot: error mounting /sys: No such file or directory
Switching to new root and running init.
switchroot: /dev does not exist in new root
Booting has failed.
Kernel panic - not syncing: Attempted to kill init!

I've also tried running the system on the original drive with all the kernel options available (default, safe settings, no acpi, no local apic) to no avail.

I think booted the server using a live disc and tried to access the drives through the UI. Clicking on either drive however results in an error that reads: "An error occurred while accessing 'Volume (ext4)', the system responded: org.freedesktop.Hal.Device.Volume.PermissionDenied: Close Device /dev/sda1/ is listed in /etc/fstab. Refusing to mount." Both drives give the same error (except sda is changed to sdb for the other drive). The mount command also did not work for me.

I've managed to look at both drives using fdisk and it shows that both are exactly the same as far as partitions are concerned.

At this point I'm a little lost and could honestly use any help or suggestions on how to proceed.

P.S. I should also add that on reboot (and in every subsequent boot) the server spits out a message saying a problem has been detected with your harddrive, details can be found in the event log.
 
Solution
Boot off LiveCD, switch to console (Ctrl-Alt-F1), and from there execute "fdisk -l". Note devices and partitions of your data drives, eg /dev/sda2 or /dev/hdc4
Once you see that, try to mount the volume with "mount /dev/sda2 /mnt".
The system might complain also that the file systems are "dirty", fry "fsck -r /dev/sda2" to san and fix file system.
Boot off LiveCD, switch to console (Ctrl-Alt-F1), and from there execute "fdisk -l". Note devices and partitions of your data drives, eg /dev/sda2 or /dev/hdc4
Once you see that, try to mount the volume with "mount /dev/sda2 /mnt".
The system might complain also that the file systems are "dirty", fry "fsck -r /dev/sda2" to san and fix file system.
 
Solution

NightHawk21

Distinguished
Aug 11, 2011
15
0
18,510


I checked the partitions on friday, and each of the two drives has 3 (cylinder numbers might be a bit off going off of memory here):
sda1: ~2600 cylinderrs (boot)
sda2: ~1000 cylinders (swap)
sda3: The rest (drives are 1.5TB).

Both drives are identical except sda1 is marked as as boot/active.

I'll try mounting each individual partition to see what works and running fsck and update tomorrow.
 

NightHawk21

Distinguished
Aug 11, 2011
15
0
18,510


First off let me just start off by saying thank you! (and a little sorry for the late reply - explanation below)

On Monday we were rearranging some furniture in the lab office to accommodate some more students that were joining the group. Following the move our server went from booting and hanging (like in the original post), to not booting on at all. I spent most of the time on Monday and part of Tuesday (needed to get a spare PSU), cleaning out the inside of the machine and confirming that the PSU died. After getting a replacement PSU we managed to get back to the previous state (reason for the late reply).

Once the system was booting again, I booted off live disk and tried mounting each of the non-swap partitions. In both cases they returned the same error: "mount: unknown filesystem type 'ext4'." I figured I'd run fsck on the data and boot partition on the primary drive to scan for problems before using the repair flag. The data partition returned a clean result, but the boot partition logged a few errors on the scan. On a whim I decided to reboot at this point, and got a different error on boot which dropped me into a root shell and prompted me to run fsck again. I did this and after answering yes to all the prompts rebooted into a working system. I then logged in and rebooted a few times to make sure this wasn't a one-off, fully installed the PSU, and checked that it could be mounted as a windows share on the windows machines.

Seriously, thanks again!