[solved] silent data corruption on PATA USB enclosure

thetrivialstuff

Distinguished
Jun 4, 2011
22
0
18,520
I was going to ask this here today after Googling in vain for an answer, but I happen to have just solved it. I'll post it anyway, though, on the slim chance that it happens to anyone else:

I have an external USB HDD enclosure for PATA (IDE) drives. I use it with lots of different drives, and often bring it around with me to job sites in case there's a drive that needs data recovered off of it or something.

I also have an unreliable 160 GB IDE hard drive that I had to pull out of a server when it started getting bad sectors. SeaTools claimed to have fixed it, which I confirmed with a badblocks scan, so I relegated it to a backup / random unimportant crap drive.

A few months ago, I suddenly couldn't mount it any more. The drive was recognized by the OS, but the partition table was gone -- completely mangled. The drive returned no read errors; it acted as if it was reading just fine and I could read sectors off of it with dd. No symptoms of a dying drive.

This was alarming, because normally if data goes bad, the drive fails out with a read error. As far as I know, the probability of bad data quietly passing the ECC check is so low that it's effectively nil. Finally, in the hex view of the sectors, I saw what was going on: certain bytes looked like they were translated into others, as if someone had run the tr command on the whole drive. Every occurrence of the letter "n" had become a "p", for instance (or something like that). No read errors, but the data had quietly changed all across the whole drive.

Now I doubt many people would go and look at their drive with a hex editor, so I bet if this kind of thing happens elsewhere it's normally interpreted as "oh crap the drive is dead / the data is corrupted / my computer or a virus did something to it / I held it too close to a really big magnet".

Here are some hex dumps off the drive. The first one is what's actually on there:

00000000 6e 74 6c 64 72 20 69 73 20 63 6f 6d 70 72 65 73 |ntldr is compres|
00000010 73 65 64 01 00 01 00 01 00 01 00 01 00 01 00 01 |sed.............|

Now here is the exact same piece of data, when the drive is malfunctioning:

00000000 6e 70 6c 60 72 20 69 70 20 60 6f 68 70 70 65 70 |npl`r ip `ohppep|
00000010 73 60 64 00 00 00 00 00 00 00 00 00 00 00 00 00 |s`d.............|

See? 't', 'r', and 's' have become 'p'; 'c' and 'd' have become '`', and a bunch of things that were 0 bytes have had some of their 1's turned on.

The drive was working fine this morning and I couldn't get this to happen (it had been happening about every other time I plugged the drive in before, so naturally it would work fine when I finally get around to writing about it...). So, I set about deliberately not plugging the USB cable in all the way, resetting the power unexpectedly, running it with the enclosure open... everything I could think of. My leading theory before was the USB cable, but I could never be sure which cable it was (I have a few identical ones, so I thought one of them was bad and I kept forgetting to mark it).

I was close; it was indeed a cable -- apparently, there is no form of error correction when data is transmitted across a PATA cable. When I pulled it out enough to make a couple of the pins lose contact, that's when I got the random: silent character translation, and a drive that acted as if everything was normal.

So, the drive is fine. If this ever happens to you, it's your PATA cable that's bad, or just loose. I think the connector on this one has gotten worn a bit from being plugged and unplugged so many times.

Can anyone with intimate knowledge of PATA confirm that there really is *no* ECC checking of any kind between the drive and the PATA controller? I find that hard to believe, and I suspect that maybe my USB enclosure just has a really cheap controller chip in it.

Anyway... there you have it. Mystery solved. Anyone know where I can get a replacement IDE cable that's about 2 cm long? :p

~Felix.

PS: Fortunately, as strange as this was, it's not harmful to data if you're running Linux. It'll look like your partition table is gone, but as long as you don't try to write to the drive or reformat it, your data is OK after you wiggle the PATA connector. Windows, however, managed to completely destroy a filesystem on the drive when it was exhibiting symptoms -- apparently if Windows sees something that looks like an NTFS filesystem, it'll try to write a bunch of data (maybe trying to fix it) before showing the "oops, this drive isn't formatted; would you like me to format it now?" dialogue box.
 

John_VanKirk

Distinguished
Hi Felix

Boy, you are really getting down to the nuts and bits, so to speak. Never thought about error correction over those old cables.

Found 2 references both talking breifly about CRC added to Ultra ATA-3 because of corruption possibility in the control packets.

Ultra ATA is not a formal standard but rather a term that refers to the use of the higher-speed DMA-33 transfer mode (multiword DMA mode 3), running at 33.3 MB/s. Special error detection and correction logic (CRC) is used to support the use of this high-speed mode over a standard IDE/ATA ribbon cable (which has not changed since transfer rates were below 5 MB/s and can now be a problem in terms of corruption when used at very high speeds.

Ultra DMA introduced CRC-based error detection in data packets as part of the ATA-3 standard. However, no parallel ATA standard offers error detection in command or status packets. Even though the size and frequency of occurrence of command and status packets is small, the probability of errors occurring in them cannot be dismissed.

With such a short cable, could some of ground lines become ineffective in preventing crosstalk at the higher starting voltages?

Nice job in pinning down a real problem


 

thetrivialstuff

Distinguished
Jun 4, 2011
22
0
18,520


Of course -- I should have remembered that, because Ultra-ATA CRC errors are how I first learned about the need for 80-conductor cable, when that was first coming out :)

I think the controller in this USB is cheap, because it transfers at UDMA-33 speeds, but doesn't appear to be using CRC.



That's quite possible, actually -- the problem didn't become really frequent until I was using this PATA enclosure less and less, which motivated me to screw the drive in properly. Before that I was constantly swapping drives in and out of it, so I never bothered with the screws -- which let the drive sit farther away and stretched the cable out longer. When it's screwed in, the cable is folded over and the two connectors are practically touching -- and they're offset a few millimetres horizontally, so crosstalk might be an issue.

I think that small cable or its connectors are definitely bad now, though -- I turned the cable around, in case the more worn end could get a better fit on the controller side than the drive side, and the drive started eating writes (I send it some data and it acts normal from the computer side of things, but the write LED stays off and then the drive offlines itself). Maybe I should just get a better enclosure :p

~Felix.