Sign in with
Sign up | Sign in
Your question

S.M.A.R.T. and false positives

Last response: in Storage
Share
November 6, 2009 10:07:55 AM

Hi there,

last January I bought a 1.5 TB Seagate Barracuda (ST31500341AS). In the last few weeks I've been having trouble with it and I'm rather confused because I get conflicting information.

I stared getting "delayed write failed" errors in Windows XP, so I tried the steps suggested in the MS support knowledge base, I did all the following:

- disabled write caching on the disk

- set out to increse the number of page table entries in the registry but realised that the number was already high enough

- replaced the SATA cable

- tried to connect the drive to a different SATA channel

the only thing that *reduces* the frequency of errors is disabling write caching on the disk, but they still happen from time to time. I never got any S.M.A.R.T. warning at boot.

I tried Seagate's test suite, SeaTools for Windows, and the disk passed the Long Generic test without reporting any issues (the report is rather laconic though, it simply says the disk passed the test and offers very few details basically claiming that everything is OK).

Then last week I upgraded to Ubuntu 9.10 (I have a dual boot machine obviously) and the new disk monitoring tool that ships with it, Palimpsest, told me that the "Disk has many bad sectors" (466 reallocated sectors, normalized 89, worst 89, threshold 36).

I ran more tests:

- I repeated the Seagate test, it still didn't report any errors

- Smartctl 5.38 for Linux and HDDScan 3.2 for Windows report the exact same information as Palimpsest (466 reallocated sectors)

- HDD regenerator 1.71 for Windows says everything is OK (no errors at all)

- Spinrite 6 dies with an unrecoverable error after a couple of hours (but apparently this is a known issue of Spinrite, nothing to do with errors on disk)

In all this, I keep getting disk write errors (but only in Windows, and even there the only program that triggers them seems to be eMule) and the reallocated sectors count reported by Palimpsest has risen to 471.

Perhaps I should mention that my computer has 1 other disk that works perfectly (it's an old 300 GB Western Digital where both OSs reside).

Does anyone know which program I should trust on this, or can suggest anything else I could try?

I considered returning the disk, but I'm not sure they'll accept it since SeaTools doesn't report any problems.

Thanks for reading this very long post, any help will be very much appreciated.

Cheers,
DE

More about : false positives

a c 127 G Storage
November 6, 2009 11:03:24 AM

Download HDTune and post a screenshot of the SMART tab, if you will.
a c 127 G Storage
November 6, 2009 11:04:25 AM

In Linux/ubuntu, you can also post the text output of:

sudo smartctl -a /dev/sda
Related resources
November 6, 2009 11:18:27 AM

Thanks sub mesa, here it is:

======================================

smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: ST31500341AS
Serial Number: 9VS0S1Q1
Firmware Version: LC1A
User Capacity: 1,500,301,910,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Fri Nov 6 14:13:50 2009 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 617) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103b) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 113 099 006 Pre-fail Always - 53037308
3 Spin_Up_Time 0x0003 090 089 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 211
5 Reallocated_Sector_Ct 0x0033 089 089 036 Pre-fail Always - 471
7 Seek_Error_Rate 0x000f 079 060 030 Pre-fail Always - 98010881
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5545
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 216
184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 4295032836
189 High_Fly_Writes 0x003a 059 059 000 Old_age Always - 41
190 Airflow_Temperature_Cel 0x0022 054 039 045 Old_age Always In_the_past 46 (35 174 47 40)
194 Temperature_Celsius 0x0022 046 061 000 Old_age Always - 46 (0 13 0 0)
195 Hardware_ECC_Recovered 0x001a 044 031 000 Old_age Always - 53037308
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 5480 -
# 2 Extended offline Completed without error 00% 5464 -
# 3 Short offline Completed without error 00% 5460 -
# 4 Short offline Completed without error 00% 5354 -
# 5 Short offline Completed without error 00% 5169 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
a c 127 G Storage
November 6, 2009 11:41:24 AM

No obvious imminent failures. But the Reallocated Sectors (471) is pretty high, and its generally a bad sign if it rises. This will cause delayed write errors. The number of bad sectors on your drive appears to be increasing.

If you can, use Spinrite on another PC with different BIOS to test the drive. Spinrite is pretty good at these things.

So i do think your drive has some issues; but its not totally failing. The cable appears to be fine, else you would have seen UDMA CRC errors in the smart logs. It appears to be a harddrive surface issue. Make sure anything on this drive is properly backed up!
November 6, 2009 12:33:58 PM

OK, I'll try putting the drive in an external enclosure, connect it via USB to my laptop and run Spinrite from there. Let's see what happens.

Thanks a lot for the feedback!
a c 415 G Storage
November 6, 2009 8:08:37 PM

From those symptoms and the SMART data, I'd say your drive is failing. There probably isn't anything you can do about it. In these circumstances, my reaction would be to move as much data as I could onto a new drive from from that one and then trash it.
November 6, 2009 10:38:13 PM

You're probably right sminlal, I think I'll call the store and see if I can return the disk (it's just 10 months old and the warranty is 5 years)

Thanks for the feedback.
November 9, 2009 7:26:08 AM

A funny thing just happened (well, not that funny actually): I replaced the failing disk with a slightly older 500 WD, it seemed to be working fine until 5 minutes ago when I got another delayed write fail. I had verified the disk with Palimpsest after installation, everything was in order.

It looks like maybe the errors don't depend on the disk after all, I have 2 suspects at this point:

- bad controller

- I read somewhere that insufficient power can cause this kind of problems, so maybe that's it

Any suggestions?
a c 127 G Storage
November 9, 2009 7:28:52 AM

Hm... Could you replace the power supply and see if the problem persists?
November 9, 2009 9:57:27 AM

Nope, I don't have a spare.

But I've been thinking that if the power supply was undersized, the problems would have started as soon as I added the second hard disk 10 months ago and not now. Besides, I'm not really sure how to calculate the power requirements for a computer.

I just finished checking the "new" disk using HDDScan's Extended test (I wanted to see if this last delayed write fail was caused by a bad sector) but it didn't report any problems.

Anything else I could try?
a c 127 G Storage
November 9, 2009 9:59:44 AM

Maybe its just that voltages aren't stable, or that the power connector to the HDD is low quality. Only tests can confirm this.

You can always ask a friend to switch your power supply with his.
November 9, 2009 10:13:51 AM

I'll figure something out, thanks a lot again sub mesa!
a c 415 G Storage
November 9, 2009 3:17:37 PM

Please keep us posted with what you find. I can imagine a marginal power supply might result in voltage drops as the drive draws extra power during seeks, so I'd be very interested to hear if that's really the cause.
November 9, 2009 4:14:36 PM

I'll keep you posted. First of all I'll try to disconnect the second drive and see if errors stop occurring.

Your comment about the extra power drawn during seeks got me thinking: maybe the reason the errors are happening only now is that the the drive is almost full and slightly fragmented, therefore there's bound to be an increased number of seeks to write large files.
a c 415 G Storage
November 9, 2009 6:47:06 PM

To be honest, in the overall scheme of things the power draw of a seek is pretty small - but if a drive that's writing data is the thing that's most sensitive to low voltage then it might be the straw that broke the camel's back, so to speak.
November 9, 2009 11:22:11 PM

Mmmm, I guess that could happen but I don't think that's the case here: I wanted to move everything off the failing disk so I tried connecting one additional disk to the system (for a grand total of 3 internal disks): I was able to move everything without getting even a single error (but it should be noted that I was just reading from the failing disk).

So I suppose this means the power supply is adequate. And the computer is connected to an APC UPS which should rule out power surges too.

Maybe the single write error I got on the other disk was just a coincidence (but who am I kidding, what are the odds?)

Anyway, I'll see if it happens again.
November 19, 2009 8:30:44 PM

I decided to return the disk to Seagate and I just got it back, they sent me a refurbished disk which appears to be in perfect conditions.

I plugged it back in, rebooted into Windows (I've been using Linux almost exclusively these last few days) and voilà: after 10 minutes I got another error (on the primary disk, not the one I just got back).

Quote:

Event Type: Error
Event Source: Ntfs
Event Category: Disk
Event ID: 55
Date: 19/11/2009
Time: 22.51.42
User: N/A
Computer: MYCOMPUTER
Description:
The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume E:.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 09 00 00 00 02 00 4e 00 ......N.
0008: 02 00 00 00 37 00 04 c0 ....7..
0010: 00 00 00 00 32 00 00 c0 ....2..
0018: 68 00 00 00 00 00 00 00 h.......
0020: 00 00 00 00 00 00 00 00 ........


I ran chkdsk and booted back into Linux just to be safe since it never reported any errors, at least as far as I can tell, where would write errors be logged? I tried examining the logs like this:

grep disk syslog
grep sata syslog
grep error syslog
gunzip -c syslog*.gz | grep disk
gunzip -c syslog*.gz | grep sata
gunzip -c syslog*.gz | grep error

but I didn't see anything suspicious.

Disk access *is* pretty slow but it might be related to this bug:

https://bugs.launchpad.net/ubuntu/+source/hdparm/+bug/2...

Honestly I don't know what else to try, I'm seriously considering buying a new computer.
December 15, 2009 12:32:34 PM

One last note to close this thread, I bought a new motherboard, CPU, VGA, PSU and RAM (so basically a new computer except for the disks): I've been using it for a week and so far I've had no disk errors, so I guess it was either the old mobo (a 10-month old Asrock 939N68PV in case you're wondering) or the PSU.

Thanks again to sub mesa and sminlal for the feedback.
!