Silent file corruption due to file system metadata... Why?

xviruz

Distinguished
Feb 24, 2015
2
0
18,510
I have a peculiar silent file corruption problem, where file system metadata gets inserted into the middle of files. The problem occurs rarely and is really only detectable by checksumming the files beforehand and running a checksum verification afterwards. I'm fairly certain now that it's either a SATA controller issue or a faulty SATA connection, but I'm all ears for other potential causes. That said, I would mainly like some assistance in identifying why it is always the same piece of ~16KB data block corrupting the files.

Some background... I'm running on Windows 7 x64, with an Asus P8Z68-V Pro motherboard, 2x4GB Corsair CMZ4GX3M1A1600C9 RAM, and Corsair TX750 PSU. I've had this setup for 4 years now and only started encountering this issue late 2014. The corruptions have happened on a WD20EARS, WD30EZRX, and ST4000DM000, all of which were connected to the motherboard SATA ports driven by the Marvell PCIe SATA controller (NOT the Intel Z68 chipset). I also have an SSD, 2x 1TB, and another set of WD20EARS and WD30EZRX drives on the Intel Z68 SATA ports, and these are all fine. All the drives are single volume GPT partitions and connected directly to the ports via SATA cable.

I keep checksums of all the files on each of the drives for a "last known good state". The corruption tends to occur after writing some amount of data to the drive (several GBs usually) and then performing a "verify checksum" over the entire drive: the corrupt file shows up with a checksum mismatch. I have not tested without writing data. As an example, I had the WD30EZRX drive sitting on my desk for the past month or two (it's a backup drive) and all files had matching checksums when it was last plugged in. The day before, I connected it to the Marvell port (shutdown computer, plug it in, power up) to copy some data over (~30GB) and then ran a full checksum (did not defrag the drive or manually open any of the files, etc.). There was a mismatch on an old ~8GB file due to a ~16KB chunk somewhere in the middle of it. However, at the same time, I also connected a WD20EARS in the same fashion, copied over ~5GB, did a full verify, and it had no errors---hence the problem being "rare".

Now, on to the corruption. The corruption always overwrites 16476 bytes of a file. Its location seems to be random---it can show up in a very old file or a new one, so I suspect some weird addressing error. The corrupt data consistently appears as 16400 bytes that are always the same, followed by 76 more bytes that differ slightly (I assume some parts are unique IDs, hence the difference; they are structurally similar). You can see the hex for the first 16400 bytes here, as well as two samples of the remaining 76 bytes here and here. For the first 16400 bytes, the string represented by the hex is:

.ãÉã\.¸M.}ù-ð..®./Ü–\.æC¯ô}.é8îª".......!...............M.i.c.r.o.s.o.f.t. .r.e.s.e.r.v.e.d. .p.a.r.t.i.t.i.o.n.................¢ Ðëå¹3D‡Àh¶·&™ÇéM8×]©»F›Ø*.Î<ˆP........ÿ·ÀÑ............B.a.s.i.c. .d.a.t.a. .p.a.r.t.i.t.i.o.n.......[...].......EFI PART....\...

I've used "[...]" to denote the omitted 0x00s (or "."s). As you can see, it's something related to partition/volume metadata.

My knowledge about NTFS and GPTs is largely non-existent, so I haven't the faintest clue why it's always this data. Any ideas or experiences? (And thanks for reading!)
 
FYI, Teracopy supports verify on copy. That is it reads a source file and calculates a checksum, writes it at the destination, then reads the destination file back and confirms the checksum matches the source. It might be an easier way to checksum files than doing the entire drive at once like you're doing. It even gives you the option to re-copy files which failed the checksum.
https://codesector.com/teracopy

Can't help explain how the corruption is happening. I'd just pay for a new SATA card rather than waste time trying to diagnose the problem.
 

xviruz

Distinguished
Feb 24, 2015
2
0
18,510
Thanks for the link. I'm already using Teracopy to make sure the new files don't get messed up. However, the corruption doesn't always affect new files---I could be copying 30 new files over, have matching checksums for all of them, and then have an existing old file become corrupt. That's why I've been using the, rather tedious, checksum method as well.