Interpreting Smartctl output on dying drive

Cassey

Honorable
Oct 23, 2013
10
0
10,510
Hi all. Had a disk drop out of my linux raid 6 group with an unrecoverable read error. No problem, did a "dd if=/dev/zero of=/dev/sde bs=512" to rewrite all the sectors and allow the disk to remap as needed. Command finished without problems.

"smartctl -a /dev/sde" shows the drive passing self-test.

"smartctl -A /dev/sde" output below:

smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.10.7-gentoo-r1] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 199 051 Pre-fail Always - 3727
3 Spin_Up_Time 0x0003 150 144 021 Pre-fail Always - 7475
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 64
5 Reallocated_Sector_Ct 0x0033 189 189 140 Pre-fail Always - 88
7 Seek_Error_Rate 0x000e 200 200 051 Old_age Always - 0
9 Power_On_Hours 0x0032 038 038 000 Old_age Always - 45789
10 Spin_Retry_Count 0x0012 100 253 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 52
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 25
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 781527
194 Temperature_Celsius 0x0022 103 094 000 Old_age Always - 47
196 Reallocated_Event_Count 0x0032 112 112 000 Old_age Always - 88
197 Current_Pending_Sector 0x0012 191 191 000 Old_age Always - 1100
198 Offline_Uncorrectable 0x0010 191 191 000 Old_age Offline - 1106
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 198 198 051 Old_age Offline - 206

As expected, the reallocated sector count is getting close to the threshold number, but that was expected.

After putting a new partition table on it and rebooting, I went to add it back into my raid-6 group. That went fine, but rebuilds times were VERY slow (like 200K/sec - e.g. 30+ days).

Ran "sar" to see what was going on and service times were well over 1.5 seconds, with utilization at 100%.

The drive is old, as are all of its brothers (1906 days to be exact - over 5 years). It has served me well, despite being in a case that isn't great for 8 drives (thus the temp being a bit high at 47C).

For comparision, here is the "smartctl -A" output for its twin, /dev/sdd, which is working just fine with normal service times:

smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.10.7-gentoo-r1] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 152 147 021 Pre-fail Always - 7358
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 55
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000e 200 200 051 Old_age Always - 0
9 Power_On_Hours 0x0032 038 038 000 Old_age Always - 45802
10 Spin_Retry_Count 0x0012 100 253 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 52
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 32
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 806692
194 Temperature_Celsius 0x0022 105 095 000 Old_age Always - 45
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0

I'm just curious if there is anything in the Smartctl output that I might explains why it is performing so slowly? Its useless right now and has been failed out of the Raid group. Of course, if there are any suggestions on how to revitalize it, I'd sure love to hear them, since I don't like being down to one parity drive.

Also curious what I should be looking for to try and see if the other 6 drives in the system are likely to make it a few more months or years. Rather concerned about that since 6 of the 7 drives in the Raid-6 group are all of the same vintage.

Thanks in advance!

Cassey
 
Solution
reallocated sectors are moved to another area of the drive and thus preform like random rather than sequential. I suspect that is part of the problem. 206 Multizone errors, 1106 uncorrectible, 1100 pending - this disk is in sad shape.

I wonder why you are reusing a failing drive? its just going to fail again and if another 2 drives fail during the long rebuild say goodbye to your raid and everything on it.

popatim

Titan
Moderator
reallocated sectors are moved to another area of the drive and thus preform like random rather than sequential. I suspect that is part of the problem. 206 Multizone errors, 1106 uncorrectible, 1100 pending - this disk is in sad shape.

I wonder why you are reusing a failing drive? its just going to fail again and if another 2 drives fail during the long rebuild say goodbye to your raid and everything on it.
 
Solution