HDD Data Diagnostics/chkdsk conflict

stevti

Honorable
Mar 26, 2012
3
0
10,510
Hi everyone, this is my first post here so hello to you all,

I have a bit of an issue that has me completely stumped. It is regarding the results of all variations of chkdsk, /r, /f, volume label: /v /f etc and the results of the Western Digital Data Lifeguard disc utility. I apologise upfront if this first post is a bit lengthy, I'm unsure which parts of my saga might be relevant or not relevant so I'll start from the beginning....

Fistly, my PC is set up as follows...

Asus P5N-D motherboard,
Intel (R) Core 2 Quad CPU
Q9300 @ 2.50Ghz
2.50 Ghz, 3.25 GB of RAM
ATI HD Radeon 4350 Graphics Card

PC is connected via HDMI to my Onkyo TX-SR506 receiver and then on to my Toshiba 40inch LCD.

I have recently had to replace my old SATA Western Digital 320gb HDD as it was a few years old and had started causing problems on boot-up. I bought a used 400gb Western Digital unit off eBay (budget is tight so 2nd hand was the only option). My local PC man installed Windows XP Pro SP3 on my new drive for me, and recovered the old drive so that I could copy my old data, re-format it and use it as a secondary drive for storage. This setup worked fine for a couple of days but then started having problems on startup again, after the Windows logo screen it just stayed black...forever! After much searching of the Internet on my laptop for solutions/similar cases etc, I was being torn between a display driver error or another HDD failure. I was also unable to boot into any of the 3 variations of safe mode, it would just hang on the list at mup.sys. The BIOS boot order was all correct, CD first, new primary drive second, and the older drive third.

Using my own Windows XP recovery disc, I ran fixmbr, which succesfully wrote the new mbr, I got an error message (that I cant actually remember) from fixboot, and I got the message 'there is one or more unrecoverable errors' from chkdsk. Chkdsk /r also said 'there is one or more unrecoverable errors. This led me to believe that it in fact was major issues with my 'new' drive. Through my laptop I downloaded the Western Digital Data Lifeguard utility, and ran that on boot-up, and results of a quick test threw back error code 0007, and on an extended test threw back error code 0225, which says 'too many errors' please see here for the error codes and there meanings....

http://support.wdc.com/techinfo/general/errorcodes.asp

....the error code 0007 refers to a problem with the SMART monitoring, I get this message wether SMART is enabled OR disabled in the BIOS.

So there I was, 2am and sat staring at a seemingly completely knackered PC, again!! After only getting it back two days previous! I then remembered a thread I had read somewhere, where someone had said to try swapping the SATA cable with a new one, and I had a new one that my local PC man had installed to connect the old drive as a secondary unit. So I disconnected the old unit, used that new cable to connect the primary drive to the motherboard and boom! Booted up first time faultlessly, and noticably quicker than before! I couldnt believe it! So my theory was at that point, it was a dodgy cable! Couldnt quite understand how I could get such serious errors from a HDD diagnostics utility because of a dodgy cable but hey, it worked so I was happy. So I went back to the PC shop and bought another new SATA cable to reconnect my older HDD back up again, but as soon as I did, it wouldnt boot up again, same characteristics as before, after Windows logo screen it just hung on a black screen. So the problem wasnt the SATA cable, it was something to do with the old drive being connected. I ran the WD diagnostics on the old HDD after it had been reformatted and it came back completely clean, no errors or bad sectors found!

So, the system seems to work flawlessly without the old drive plugged in, but just out of interest I had to see if, now its all working fine without the old HDD, will the results of chkdsk and its variations be any different for the 'new' HDD. Yes they were, I ran all chkdsk variations and all came back clear, not one bad sector anywhere. I read on a Microsoft site that sometimes chkdsk will not find any errors on an NTFS drive, healthy or not, and to run chkdsk volume_label: /v /f from boot, which I did, this also returned no errors. So I then ran the WD Diagnostics utility again to make sure the drive is ok, but no, the error messages still appear, on quick test it throws back the 0007 error code and on extended test throws me the 0225 code, meaning too many errors to recover from!!!

So chkdsk says im totally fine with no errors at all and WD Diagnostics says my drive is fit only for the bin, but yet the system seems to run perfectly with no issues at all now the old drive isnt plugged in!!

I am again very sorry for the long, rambling post, can anyone enlighten me as to any possible reasons these messages are so conflicting?

Thanks in advance,

Steve
 

Paperdoc

Polypheme
Ambassador
Chkdsk and Data Lifeguard do different tests, and do them in different ways, so getting apparently inconsistent results is entirely possible.

I'd start with the Data Lifeguard testing system first. Those tests are done using direct access to the hardware and do not really depend on the nature of the data on the HDD, nor even on the OS installed and running. The Quick tests check several easy things, and one of those is to check the SMART data stored on the controller board of the HDD unit itself. That board runs tests on the HDD all the time in the background, and writes any error messages to its own CMOS for later retrieval. (By the way, Enabling SMART Reporting in the BIOS Setup merely means that the BIOS will ask the HDD for this data and forward any error messages to your screen at boot time so you can see them. It does NOT wipe them out or make them unavailable to other error checkers. That is why Data Lifeguard can find and display them, no matter how the BIOS is set.) Your tests say that there are SMART error messages stored. It also probably shows you more detail about what the SMART errors are, but you have not reported them here. Usually this condition is a good reason to replace the HDD before more serious errors emerge, but not always. Sorry to say this, but maybe that's why this used HDD was being sold on eBay - the previous owner has replaced one with SMART errors while it was still functional.

Running the Data Lifeguard Long diagnostics does a much more extensive job, and probably also generated a longer detailed list of specific errors on the drive. There could be lots of things. One of the simpler types is just a large number of "Bad Sectors" - ones that return read errors when accessed. If that's the case, if there are enough of these, the test ultimately will end with the code that it gave up because the number of such errors exceeded the pre-set limit. That may well be an indication that this disk is progressing from questionable reliability to not reliable at all, but maybe not. To some degree, that judgment is up to you - how much risk can you tolerate?

So, if your HDD has "Bad Sectors" according to Data Lifeguard, why does Chkdsk not report the same? Well, when your service guy installed the used HDD for you, one of the steps VERY likely was to do a Full Format on the unit before installing anything on it. That process runs a read/write test on EVERY sector of the HDD (well, every sector the disk allows it to see - see later). Any Sectors found to be defective are marked in Windows' own fault table so they are NEVER used by Windows. From then on, subsequent uses of the HDD by Windows do NOT ever try to use those Sectors, and that includes running Chkdsk. It will NOT re-find those Bad Sectors - it will only find any new ones not already marked as Bad.

Now, suppose that the problem you are dealing with on the WD used drive is primarily a high count of "Bad Sectors". There is a way you can try to "fix that" so the drive is good again. Actually, you can NOT repair those sectors, but you may be able to permanently make them invisible, and hope that no more develop. Here's how that works.

Background: on modern HDD's the PC board in the unit contains a small microprocessor, a small BIOS with pre-programmed tasks, and some memory of a few types. This system runs its own checks on the unit in the background as it operates, and keeps track of things in its own storage facilities. None of this is exposed to the outside world, and Windows knows nothing of this. One of these is allocation and testing of sectors on the platters.

At the time that the HDD is first manufactured and the Low-Level Formatting is done (that is, writing the magnetic tracks and defining them into individual Sectors), many more Sectors are created than are needed for the specified capacity. Then they are all tested, and any defective Sectors marked off. Then the number of sectors needed for the specified capacity is assigned for use, and the remaining good ones are recorded in a table of Good Spare sectors for later use. Afterwards, during normal use whenever a sector is read, the quality of the signals coming back is assessed by the HDD's own mini system. If it decided the signals are erroneous or even just plain weak, it will re-read the sector to get the data correctly. Then it will replace that Sector with one of the Good Spares, writing the recovered data to that, and mark off the questionable sector as Bad and never to be used again. Then it will record the fact in the SMART tables that one more "Bad Sector" had to be replaced. So far, to the world outside the HDD, the unit has no problems at all. This process simply goes on all the time. BUT at some point (a pre-set limit), the SMART system sends out an error message that too many faulty sectors have been replaced. This means two things. One is that the stock of Good Spares is getting low, so continuing without remedial action risks getting to the point that there are no more Good Spares to use, and the automatic self-repair process will fail. The other is that as more failures develop, the rate of failure may be increasing, and that's a BIG warning that you should replace this HDD while the data on it are still good, and before a BIG failure means you can't actually get all your data back. For both reasons, this is a clear warning that you should replace this HDD soon while you can get a good clone copy made and before data is lost. (I had to do this recently because of this type of SMART error message. The cloning process took a longer time than normal because it kept finding faulty sectors, but the final clone copy on the new drive apparently had NO errors in it, and the operation was a success!)

So, that's the process of self-diagnosis and -repair on modern drives. Now there is a way to use this to "fix" - well, HIDE, really - a drive with this trouble. In Data Lifeguard one tool you will find is a Zero Fill of the HDD. BE AWARE- this process COMPLETELY DESTROYS all data on that HDD, so you will have to be able to restore everything on it later. A backup or clone copy on another unit is necessary. (Maybe you feel you already have such a copy - your old HDD.) If you run a Zero Fill on your WD used drive, the process will write all zeros to EVERY sector in use on the unit. As part of the background work, though, that also means that every write will be followed by a read and assessment of the signal quality and data integrity of that sector; any deficient sector will be replaced by a Good Spare automatically. If all goes well, by the time the Zero Fill is finished, there will be NO "Bad Sectors" in use at all. When it is done and you Partition the HDD and then run a Full Format on it under Windows, there will be no "Bad Sectors" discovered by Windows, and its own updated internal table will be empty.

The risk / problem with this process is threefold. If the HDD already has too many Bad Sectors, it is possible that the automatic repair process during the Zero Fill operation might actually run out of Good Spares to use, and it will fail. Secondly, this process temporarily ignores the potential future failure of the HDD. It still has a proven history of failing sectors and a depleted stock of Good Spares, so you are increasing your risk of actual HDD failure and data loss. Thirdly, this process will NOT reset the SMART tables, so the SMART error message of too may Bad Sector replacements will not go away.

Armed with this knowledge, you can make some judgments. Is the source of your HDD problems just this matter of high numbers of self-repairs already, or is there some other problem showing in the error messages? And, if that is the only problem, are you willing to do the Zero Fill and data restoration to "fix" it, knowing that the HDD's reliability is questionable?

Hah! And you apologized for YOUR "long rambling post"!
 

stevti

Honorable
Mar 26, 2012
3
0
10,510
Hi paperdoc, many thanks for your informative reply :)

I now have a much better understanding of how HDDs work thanks to your response, hugely appreciated dude! I've gotta say even though I was cursing the PC when it failed again, I've enjoyed doing what I've done and the long hours of testing this, testing that etc as I've learned so much, so it's been good in that sense. But based on what you've said, and the results I got from the WD Diagnostics utility, I think it would be the wisest option to save up a bit longer and treat myself to a nice spanking new HDD, possibly solid state if my wallet will stretch that far. Yes the used one I bought from eBay seems to have errors, but at the end of the day you get what you pay for and, well, I didnt pay much I'll admit. At the end of the day the system works now so the sooner I can afford to get a new one and get my data safely onto it the better.

Many thanks again for your response paperdoc,

Best Regards

Steve
 

Synchead

Honorable
Sep 15, 2012
1
0
10,510










Hi

Smart data is still okay but quick test on Data Life guard fails. When running a systems image it crashes requesting Chkdsk. Every time I run chkdsks/r at the beginning of boot it claims that different bad sectors and files were fixed. Already got RMA from WD.

What is going on. You think continuing the process will give me a systems image that completes. Maybe better to install from scratch and just back up the data

 
Data Lifeguard sometimes has trouble accessing your HDD via certain drivers or SATA controllers. In such cases a tool such as HD Sentinel or HDDScan might be a better option. In fact these tools produce a much more comprehensive SMART report than DLG. Look for reallocated, pending, or uncorrectable sectors.

You might also like to check whether the drive is configured in RAID mode rather than AHCI mode in your BIOS setup. Sometimes this is the reason for problems with DLG.

An alternative way to test your drive with DLG is to use the bootable CD ISO version. If DLG is unable to detect your drive, then temporarily reconfigure your SATA controller in your BIOS setup for legacy or IDE compatibility mode. This will make your SATA drive look like PATA. Restore the original settings after the test.

 

rezoloot

Honorable
Aug 26, 2013
28
0
10,540
I'm fairly sure I am suffering with answer number two syndrome, I got over 305 errors in one day, the disk had not been shook, or dropped, simply ran hdd sentinel and on day two, said 305 errors, I've resorted to data lifeguard, it failed quick and extended tests, I ran write zero's to drive, total success (no errors at all, very odd) ran quick test again, no errors, running extended not, its flying through that and expect it to say all clear, my problem is, I did have the drive inside my machine and have since then find, the sata cables were shall we say, old, dilapidated, knackered, you pick so changed them out.
Ok so on to the crux of the issue, obviously, if hdd sentinel says ouila alls well, then nothing else is needed but if all the errors are still registered, does that mean, that sentinel needs to be refreshed and maybe thats the end of error reporting or, as you said, even if the errors have now literally been wiped clear and the drive is actually good, could the smart bios be set in stone and thats the end of it or could the hdd's smart bios be rewritten by itself..

God, I've been up for two days now doing this, not even sure I had an actual question in there