Looking For A Fix
Not surprisingly, it can sometimes be difficult to tell which component in a drive has failed. Here, an SRS technician examines a read-write head apparatus for damage or defects under a microscope. In some cases, data can be rescued simply by replacing such faulty components with spares from the SRS stockpile, reassembling the drive, and firing it up while connected to a recovery system. If this doesn’t sound like a recipe for a 100% healthy drive, understand that it doesn’t have to be. During recovery, SRS only needs to spin up most drives one or two times. That’s enough for technicians and their custom software to pull all of the data that can be salvaged off the drive. The original drive usually doesn’t go back to the customer. All in all, SRS claims a greater than 90% success rate in returning data to clients.
200 Heads Are Better Than One
“Fixing drives gets harder over the months and years, not easier,” says Peter Oswald. “Just look at the evolution of head technology from magnetoresistive to perpendicular to heat assisted, on and on. A while back, a lot of shops were doing data recovery, and the vast majority of that is software-related or real easy logicals. Maybe you’re taking a board off of doing software corruption on the drive itself. But the ones that are really difficult are the mechanical, where you have to tear down deep into the drive and recover data on the platter surfaces or having to replace a head.”
When Files Go Afoul
Seagate deals with a wide range of logical damage recoveries, including my own. Users might have a NAS or a SAN in which someone accidentally deletes a folder. The technician must then, say, find millions of file fragments to piece together complex databases that could be spread across dozens of hard drives. Logical faults revolve around file structures rather than drive mechanics. Seagate says that in the majority of RAID recovery cases, though, both physical and logical damage is typically involved.
In some instances, company policy won’t allow for even failed drives to be sent off-premises. Obviously, the cost for on-site recovery would be significantly higher than normal, but it can be done.
I asked Seagate to show what my own RAID recovery looked like, and this was the result.
“I know you want eye candy,” says Peter Oswald, “but most of what we do is going to look really normal and ordinary to people who don’t know what we’re doing. This controller you see here looks like a plain card, but it was designed specifically for us, just like the software. It’s controlling the power…just many, many components of these drives. It’s controlling very small details that go on in the background.”
In my case, my drives weren’t responding in a timely manner to the NAS system’s requests, and so the request would time out. But whatever in the NAS “environment” was causing that issue was likely also impacting the other drives. So eventually, the enclosure system tagged one drive as bad, continued to operate in degraded RAID 5, and when the second drive failed, the RAID collapsed. Seagate had to construct a new environment for the four drives in which the drives were instructed to believe they were healthy. With this done, technicians could begin copying my data out to other storage. Interestingly, this doesn’t mean that the RAID was back in working order. The highest priority was simply to extract my bits from the drives and copy them in a “de-striped” state onto known good media. Then the recovery crew could begin trying to piece together my file structures and reconstruct the original four-disk volume’s architecture.
All told, Seagate spent roughly 28 man-hours recovering my data. Much of this involved identifying the critical data structures across all four drives, determining the correct stripe sizes, and finding out the ages of the data to see which came before and after the loss of the first drive.
As part of the drive evaluation and repair steps, Seagate SRS runs an app called Drive Repair and Unlock Tool. Seagate wrote and maintains the title, and many of the company’s engineers use this software to work with and remediate drives. The app tells engineers what is going on with the drive, particularly what is going wrong. “But it’s not always a science,” says Peter Oswald. “Half of this is art and feel.”
Copy First, Questions Later
The number one thing SRS seeks to achieve is copying data out from the damaged drive onto reliable media. Everything else, including trying to figure out what went wrong in the original hardware, is secondary. Fortunately, the company uses some very sophisticated, proprietary tools designed to read data in ways that regular operating system and drive commands cannot.
“This software was written for Seagate SRS and serves to extract a complete copy of the target drive’s data,” says Peter Oswald. “Essentially, the application lets us handle situations with drives that a normal computer can’t deal with. Green here is good. In what you see on this screen, we’ve physically repaired the drive and reattached it to the system. It’s responding to us in a manner that we expect. And we’re duplicating the data from the source, the clients’ drive, to one of our known working drives so we can move forward in the process.”
Deep In The Data
This is a view of data at the lowest level possible on a hard drive. Specifically, the application is a hex editor, which allows engineers to view and manipulate user data at the binary level. Peter Oswald notes, “When we look for specifics about your logical recovery, we start looking for certain things, depending on the criteria of the recovery...different structures and data points. This is how we do it, here in hex.” Not only can technicians see the underlying form of the data, they can sometimes repair broken file elements and turn corrupted files back into functional ones.
Scope This Out
SRS accepts practically any sort of storage device for recovery, including SSDs and flash drives. The first challenge with recovering a dead flash drive is to figure out which components on the device are failing. Here, the technician is looking for a signal to pass through from the processor to the flash component under test. The idea is to confirm that commands are flowing and being processed. In this way, poring over flash chips, transistors, and so on, engineers can get to the root cause of the failure issue.
A Hot Fix
While the storage medium may change, the recovery process between flash and magnetic disk devices is fairly similar. When faulty components are identified, they must be replaced. While SRS staff didn’t have the backstory of this drive on hand during our photo shoot, it was clear that the little unit had taken quite a beating. Several components needed replacing. With some careful soldering work, skilled hands can get a busted drive working again in fairly short order.
Chucking The Chip
“The most common thing we’ll see with flash drives in particular is people will have it in their computer, working with files on it, and then their arm will come down on it or whatever and break off the drive,” says Peter Oswald. “But we see drives stepped on, chewed on, dropped in water, run through the wash; everything. The portability is what creates more opportunity for damage.”
In this image, a desoldered flash memory module is being removed from the busted thumb drive.