QNAP NAS Hard Drive SMART errors

nmp

Honorable
Oct 11, 2013
8
0
10,510
Hi,

I have a QNAP TS-639. Recently I noticed some 'warnings' from storage manager. I've dumped out the SMART errors from the disks and notice that I'm now seeing two separate errors across three disks.

Disks 2 and 4 are marked as 'Warning' in QNAP storage manager for the below errors:

Reallocated_Sector_ct with raw values of 1 and 10.

Disks 4 and 5 are both marked with a status of 'Good', but with value:

Raw_Read_Error_Rate of 19738350 and 13047954.

See attached SMART report, and screenshot of the QNAP Storage Manager.

Should I assume that these disks are facing imminent failure, and swap them out or can these errors be safely ignored?

Thanks all

SMART Values

QNAP Storage Manager Overview:
overview.jpg

https://www.dropbox.com/s/kmw91gwrl7r63oo/overview.jpg?dl=0

Disk 2 Overview
https://www.dropbox.com/s/a1da35g0qw2xz2q/error_disk2.jpg?dl=0

Disk 2 Detail
https://www.dropbox.com/s/ddoowmsvz61vemo/errordetail_disk2.jpg?dl=0

Disk 4 Overview
https://www.dropbox.com/s/ybx9augrov178ca/error_disk4.jpg?dl=0

Disk 4 Detail
https://www.dropbox.com/s/34isvesm922hl27/errordetail_disk4.jpg?dl=0

Disk 5 Detail
https://www.dropbox.com/s/fuvg1rjse6dmwka/errordetail_disk5.jpg?dl=0
 
Solution
What I mean about Read Error Rate is that there's going to be multiple information encoded into the byte string. The decimal representation of that number is not really meaningful. Also, the amount of reads is likely part of the values so the raw value should continue to grow as the drive is used. It's NOT indicative of a problem.

To go into a bit more detail, for your drives:
139857675 = 00 00 08 56 0F 0B
202972507 = 00 00 0C 19 1D 5B

I don't remember if the exact decoding is mentioned anywhere but I think I remember reading that generally, the upper byte is a count for errors, with the lower 2 bytes being a count of reads. The actual error rate can then computed by dividing the numbers. I believe this is more or less the standard...

rkzhao

Respectable
Mar 8, 2016
183
1
1,860
I'm at work so I can't open those links but from what you listed, it doesn't sound too bad.

Reallocated sectors is pretty much the number of bad sectors that has been detected and then replaced with a spare. Any none zero number will likely show up as a warning because it means that an error has occured. It will be good to monitor this value on the drives. If they continue to increase, then the drive is certainly failing. If they don't increase, then the drive isn't generating any new errors.

Something else to go along with the Reallocated sector count is the Pending Sector count. That's the number of bad sectors detected but haven't been reallocated. This number should go down back to zero once all bad sectors are reallocated. 'm guessing yours is at zero. A failing drive will typically have pending sectors as well as a few reallocated sectors.

For things like Read Error Rate and Seek Error Rate and a few others like temperature, you have to decode the bytes in the raw value. I believe the error rates are vendor specific so there is no standard way to decode them. Don't worry about it.
 

nmp

Honorable
Oct 11, 2013
8
0
10,510


Thank you for your detailed and helpful reply. I completely understand regards the 'Reallocated_Sector_Ct' and happy just to monitor that however, I'm a bit more concerned about the 'Raw_Read_Error_Rate'. It does seem to be increasing. I have been monitoring the disks over the past fortnight, and from the first reading to the second, the error count has increased. I also took a reading today, one day since posting this thread. See below results.

Disk # Reading #1 Reading #2 Reading #3
3 112172837 13047954 13985454
4 182934429 19738350 20297045

Seems to me that the errors are increasing. I have pasted the RAW SMART output for the two disks below fyi. I'm going out on a limb and going to venture that I should start to look to replace these disk with some urgency?

==========[ BAY 4, Seagate ST31000333AS 953869, 6TE0CZVM ]
001 Raw_Read_Error_Rate 139857675 117 099 006 OK
003 Spin_Up_Time 0 100 094 000 OK
004 Start_Stop_Count 441 100 100 020 OK
005 Reallocated_Sector_Ct 10 100 100 036 ABNORMAL
007 Seek_Error_Rate 113306482 080 060 030 OK
009 Power_On_Hours 4821 095 095 000 OK
010 Spin_Retry_Count 0 100 100 097 OK
012 Power_Cycle_Count 281 100 100 020 OK
184 End_To_End_Error 0 100 100 099 OK
187 Uncorrectable_Errors 0 100 100 000 OK
188 Unknown_Attribute 0 100 100 000 OK
189 Unknown_Attribute 26 074 074 000 OK
190 Temperature_Celsius 655884325 063 058 045 OK
194 Temperature_Celsius 37 037 042 000 OK
195 Hardware_ECC_Recovered 139857675 045 028 000 OK
197 Current_Pending_Sector 0 100 100 000 OK
198 Offline_Uncorrectable 0 100 100 000 OK
199 UDMA_CRC_Error_Count 0 200 200 000 OK
240 Head_Flying_Hours 124605591196027 100 253 000 OK
241 Lifetime_Writes_From_Host 3893997598 100 253 000 OK
242 Lifetime_Reads_To_Host 3323790904 100 253 000 OK

==========[ BAY 5, Seagate ST31000333AS 953869, 5TE0A29L ]
001 Raw_Read_Error_Rate 202972507 119 099 006 OK
003 Spin_Up_Time 0 100 094 000 OK
004 Start_Stop_Count 404 100 100 020 OK
005 Reallocated_Sector_Ct 0 100 100 036 OK
007 Seek_Error_Rate 111970215 080 060 030 OK
009 Power_On_Hours 4303 096 096 000 OK
010 Spin_Retry_Count 0 100 100 097 OK
012 Power_Cycle_Count 274 100 100 020 OK
184 End_To_End_Error 0 100 100 099 OK
187 Uncorrectable_Errors 0 100 100 000 OK
188 Unknown_Attribute 0 100 100 000 OK
189 Unknown_Attribute 472 001 001 000 OK
190 Temperature_Celsius 655884325 063 058 045 OK
194 Temperature_Celsius 37 037 042 000 OK
195 Hardware_ECC_Recovered 202972507 047 030 000 OK
197 Current_Pending_Sector 0 100 100 000 OK
198 Offline_Uncorrectable 0 100 100 000 OK
199 UDMA_CRC_Error_Count 0 200 200 000 OK
240 Head_Flying_Hours 156229435396083 100 253 000 OK
241 Lifetime_Writes_From_Host 720411425 100 253 000 OK
242 Lifetime_Reads_To_Host 4154607535 100 253 000 OK



 

rkzhao

Respectable
Mar 8, 2016
183
1
1,860
What I mean about Read Error Rate is that there's going to be multiple information encoded into the byte string. The decimal representation of that number is not really meaningful. Also, the amount of reads is likely part of the values so the raw value should continue to grow as the drive is used. It's NOT indicative of a problem.

To go into a bit more detail, for your drives:
139857675 = 00 00 08 56 0F 0B
202972507 = 00 00 0C 19 1D 5B

I don't remember if the exact decoding is mentioned anywhere but I think I remember reading that generally, the upper byte is a count for errors, with the lower 2 bytes being a count of reads. The actual error rate can then computed by dividing the numbers. I believe this is more or less the standard way to report things like Read or Seek Error Rates in SMART so for the sake of argument, lets assume that this is how the value is encoded.

That means for the Bay 4 drive, 139857675 = 00 00 08 56 0F 0B, you have 0x0000 errors, out of 0x08560F0B = 139857675 reads (the unit is probably some multiple of 512 byte sectors). Similarly, Bay 5 would also have zero errors.

Now again, I'm not really sure if that's the exact way to decode the raw value for a Seagate drive. It's just an educated guess but it seems to line up with you drive usage since Bay 5 has a higher number of Lifetime Reads To Host and therefore a higher raw value for the Read Error Rate.
 
Solution

nmp

Honorable
Oct 11, 2013
8
0
10,510


Thank you so much for the detailed reply, and apologies for the slow reply. You've explained that really well, and completely put my mind at ease. I took the drives out of the NAS today, and installed the Seagate Tools on my PC, I'm now running through the Seagate diagnostics which so far have come back as 'Passed', so all does indeed seem to correlate with what you are saying.

Once again, many thanks for your help.