I’ve had a four-drive NAS in my closet for years. One day, one of the four drives failed. Because I had the drives configured in RAID 5, the array continued to work just fine. I mistook the somewhat slower performance of degraded mode to be a consequence of approaching the array’s physical capacity. The enclosure never alerted me to a problem. So when a second drive failed, all of my data (family photos and videos, music collection, two decades of work, everything) was gone. Poof. Instantly. And, because of circumstances too embarrassing for a technology professional to relate, that NAS contained my only copy of all of that data. You could hear my screams from blocks away.
Misery loves company and, of course, I am far from alone. Back in 2011, Tom’s Hardware showed that hard drive failure within three years can range up to 20%. SSD rates are better, but go ask Linus Torvalds if that was any consolation for his dead workstation.
In a blind panic, I called the biggest name in disk disaster recovery, Seagate Recovery Services, and in the process stumbled into a fascinating photo story. What follows is not meant to be a commercial for Seagate. The company did not pay for this coverage. The Tom’s Hardware editors and I simply recognized that a lot of people need recovery help, and a glimpse behind the curtain at how those operations get done might be enlightening for consumers and business users alike.
This page's image source: Wikipedia. Nearly all other artwork in this article is by Peter Panayiotou of Panayiotou Photography, Inc. (www.panayiotou.com).
SRS received 18,000 recovery cases last year. With the rollout of its Rescue & Recovery service, Seagate expects that number to climb well past 30,000 cases annually going forward. These can take the form of anything from a USB thumb drive to a multi-drive SSD array to a storage area network consisting of hundreds of hard drives. Of course, in order to service those devices, spares must often be on-hand, which means that, of necessity, SRS has access to one of the most diverse drive collections in the world. Old school SCSI? Iomega Zip? It's all there.
When submitting a drive to SRS, you begin on the company’s website, filling out a questionnaire about your storage problem, the suspected nature of the failure, observations, what was happening prior to failure, and so on. This creates the start of a case profile that stays with the drives as Seagate receives and logs them all the way through the eventual return of your data.
Interestingly, the first step in diagnosing a faulty drive out of the box is not to analyze it with software.No, the first thing that technicians do is put power to the drive, then listen to it and feel it.The customer’s original input when reporting the drive gets taken into account, butdrive experts can also tell a fair bit by how the drive vibrates, the kinds of clicking patterns it makes, and, of course, if there are any grinding noises. Obviously, grinding is bad and would be halted immediately. In general, technicians try to run the drive the least amount possible at this stage. But if the drive sounds like feels like it is in proper mechanicalworking order, then it moves to the next stage of pre-evaluation.
After a cursory mechanical examination, technicians connect the drive to a test system and see if it can perform basic tasks, such as coming up during boot, obtaining a volume letter, and performing read/write operations. At this early stage, the point is not to commence repairs. Rather, techs only want to get a fix on what area within SRS should receive the drive for further analysis and recovery work.
This is a medical-grade HEPA clean bench used to control the air around the drive while it’s open. According to Seagate, such benches allow the environment to eradicate particulate debris down to the microscopic level within the space of a few cubic feet, which is all that’s needed for most drive examination and repair. The back wallconstantly pushes out filtered air into the chamber, creating positive pressure and forcing any airborne contaminants back away from the bench.
While the Chicago facility where we shot most of the photography in this story was in the midst of taking down its clean room during our visit, other SRS sites, including Oklahoma City, can take clean room conditions beyond the HEPA-class workbench. While not common, some recovery situations call for spinning up platters in environments that are essentially free from the risk of debris contamination. Even one speck of dust between a drive head and the underlying platter can cause further damage and data loss, and some jobs demand absolute assurance of the most favorable recovery conditions.
Once the drive lid comes off, techs can get a much better sense of the physical damage that might be in play. Says Seagate’s Peter Oswald, "I’ve seen it all. Dogs chew on them. People take hammers to them. Fires where not only was the drive burned but the fire department completely doused it in water. People forget that their laptop bag is sitting next to their car when they pull out, and then they run over it. And of course we see drives that come from natural disasters."
Many people don’t realize that most 3.5” hard drives have at least one filter in them. Because these filters are installed under totally clean manufacturing conditions, they should remain totally white.However, the filter plays a key role in recovery diagnostics because techs can immediately tell if the drive heads have made contact with the platter surfaces. Such contact plows a furrow into the media, resulting in dust and debris flying off the platters and into the air circulating around inside the drive. The filter captures many of these particles and turns dark from the debris. Or, as Seagate puts it, “That discoloration is your data scraped off the media.” In addition, with the particles removed from circulation, the drive may continue to function longer, allowing the user and/or recovery techs toretrieve data without further media damage.
Especially if you click on this image to view its higher-resolution version, it’s clear that at least the topmost platter of this drive has seen better days.Hard drive platters should look like mirrors. Here, though, roughly the outmost half of the platter is scarred in a circular fashion from head crashing. Admittedly, most head crashes are not this bad; we wanted to photograph something dramatic. But when there is evidence of damage, techs willwant to get a quick sense of whether that damage extends to several platters. In this case, there was dust covering all platter surfaces, and the drive needed to be completely dismantled so that techs could get a better view of what was really going on. “When I do this kind of examination,” says Peter Oswald, “I’m looking for particles that catch the light and give me more sense of where I need to go.”
As you might expect, disassembling a hard drive isn’t like pulling apart LEGOs. One little quiver is enough to have the drive heads gouge out new furrows in already damaged platters. As a result, SRS engineers use a special claw-like tool to make the process safer and quicker. “That piece, which is specifically built for Seagate, allows us to remove the heads more safely than anybody else in the industry can,” says Peter Oswald. “You see how the heads are sitting over the very innermost potion of the drive? That’s a safe zone for those heads to sit and land. But that device allows us to go in and safely pick those heads up and clear them from the surface of the platter and remove them safely.”
Piece by piece, technicians dismantle the drive for cleaning and deep examination. Platters get stored into special carriers. In rare cases, those platters may be installed in highly specialized machines designed to perform deep examination of media tracks. Usually, though, technicians can rebuild the original drive, replacing whatever components (such as heads) are necessary in order to get that brief bit of life needed to extract the drive’s contents. Interestingly, Peter Oswald notes that his recovery team can harness software to give very precise directions to drives on how they should attempt to read data. This is one reason why drives are sometimes disassembled. Technicians use visual observation to help them pinpoint what disk areas to focus on.
Not surprisingly, it can sometimes be difficult to tell which component in a drive has failed. Here, an SRS technician examines a read-write head apparatus for damage or defects under a microscope. In some cases, data can be rescued simply by replacing such faulty components with spares from the SRS stockpile, reassembling the drive, and firing it up while connected to a recovery system. If this doesn’t sound like a recipe for a 100% healthy drive, understand that it doesn’t have to be. During recovery, SRS only needs to spin up most drives one or two times. That’s enough for technicians and their custom software to pull all of the data that can be salvaged off the drive. The original drive usually doesn’t go back to the customer. All in all, SRS claims a greater than 90% success rate in returning data to clients.
“Fixing drives gets harder over the months and years, not easier,” says Peter Oswald. “Just look at the evolution of head technology from magnetoresistive to perpendicular to heat assisted, on and on. A while back, a lot of shops were doing data recovery, and the vast majority of that is software-related or real easy logicals. Maybe you’re taking a board off of doing software corruption on the drive itself. But the ones that are really difficult are the mechanical, where you have to tear down deep into the drive and recover data on the platter surfaces or having to replace a head.”
Seagate deals with a wide range of logical damage recoveries, including my own. Users might have a NAS or a SAN in which someone accidentally deletes a folder. The technician must then, say, find millions of file fragments to piece together complex databases that could be spread across dozens of hard drives. Logical faults revolve around file structures rather than drive mechanics. Seagate says that in the majority of RAID recovery cases, though, both physical and logical damage is typically involved.
In some instances, company policy won’t allow for even failed drives to be sent off-premises. Obviously, the cost for on-site recovery would be significantly higher than normal, but it can be done.
I asked Seagate to show what my own RAID recovery looked like, and this was the result.
“I know you want eye candy,” says Peter Oswald, “but most of what we do is going to look really normal and ordinary to people who don’t know what we’re doing. This controller you see here looks like a plain card, but it was designed specifically for us, just like the software. It’s controlling the power…just many, many components of these drives. It’s controlling very small details that go on in the background.”
In my case, my drives weren’t responding in a timely manner to the NAS system’s requests, and so the request would time out. But whatever in the NAS “environment” was causing that issue was likely also impacting the other drives. So eventually, the enclosure system tagged one drive as bad, continued to operate in degraded RAID 5, and when the second drive failed, the RAID collapsed. Seagate had to construct a new environment for the four drives in which the drives were instructed to believe they were healthy. With this done, technicians could begin copying my data out to other storage. Interestingly, this doesn’t mean that the RAID was back in working order. The highest priority was simply to extract my bits from the drives and copy them in a “de-striped” state onto known good media. Then the recovery crew could begin trying to piece together my file structures and reconstruct the original four-disk volume’s architecture.
All told, Seagate spent roughly 28 man-hours recovering my data. Much of this involved identifying the critical data structures across all four drives, determining the correct stripe sizes, and finding out the ages of the data to see which came before and after the loss of the first drive.
As part of the drive evaluation and repair steps, Seagate SRS runs an app called Drive Repair and Unlock Tool. Seagate wrote and maintains the title, and many of the company’s engineers use this software to work with and remediate drives. The app tells engineers what is going on with the drive, particularly what is going wrong. “But it’s not always a science,” says Peter Oswald. “Half of this is art and feel.”
The number one thing SRS seeks to achieve is copying data out from the damaged drive onto reliable media. Everything else, including trying to figure out what went wrong in the original hardware, is secondary. Fortunately, the company uses some very sophisticated, proprietary tools designed to read data in ways that regular operating system and drive commands cannot.
“This software was written for Seagate SRS and serves to extract a complete copy of the target drive’s data,” says Peter Oswald. “Essentially, the application lets us handle situations with drives that a normal computer can’t deal with. Green here is good. In what you see on this screen, we’ve physically repaired the drive and reattached it to the system. It’s responding to us in a manner that we expect. And we’re duplicating the data from the source, the clients’ drive, to one of our known working drives so we can move forward in the process.”
This is a view of data at the lowest level possible on a hard drive. Specifically, the application is a hex editor, which allows engineers to view and manipulate user data at the binary level. Peter Oswald notes, “When we look for specifics about your logical recovery, we start looking for certain things, depending on the criteria of the recovery...different structures and data points. This is how we do it, here in hex.” Not only can technicians see the underlying form of the data, they can sometimes repair broken file elements and turn corrupted files back into functional ones.
SRS accepts practically any sort of storage device for recovery, including SSDs and flash drives. The first challenge with recovering a dead flash drive is to figure out which components on the device are failing. Here, the technician is looking for a signal to pass through from the processor to the flash component under test. The idea is to confirm that commands are flowing and being processed. In this way, poring over flash chips, transistors, and so on, engineers can get to the root cause of the failure issue.
While the storage medium may change, the recovery process between flash and magnetic disk devices is fairly similar. When faulty components are identified, they must be replaced. While SRS staff didn’t have the backstory of this drive on hand during our photo shoot, it was clear that the little unit had taken quite a beating. Several components needed replacing. With some careful soldering work, skilled hands can get a busted drive working again in fairly short order.
“The most common thing we’ll see with flash drives in particular is people will have it in their computer, working with files on it, and then their arm will come down on it or whatever and break off the drive,” says Peter Oswald. “But we see drives stepped on, chewed on, dropped in water, run through the wash; everything. The portability is what creates more opportunity for damage.”
In this image, a desoldered flash memory module is being removed from the busted thumb drive.
Given that SRS assists in data recovery for some of the world’s biggest companies, it’s not surprising that security is a paramount concern throughout the organization. All lab entrances are locked with biometric fingerprint scanners, and access is limited to only essential personnel. Truth be told, I’ve been trying to land a photo story inside of SRS for several years (even in the days before Seagate’s acquisition), but the company has always been reluctant to allow media inside its operations, in part because of security concerns. Fortunately, the stars aligned (the Chicago facility was in the process of moving, plus the Seagate was comfortable with trusted photographer Peter Panayiotou) and we were able to move forward with this article.
Once recovery is complete, customer data gets returned on external media. I had about 3 TB of data on my NAS, and Seagate returned it all to me on a 4 TB external USB 3.0 drive. The verdict was that only six of my MP3 files could not be recovered. Compared to losing all of my family photos and videos? Yeah, I can live with replacing six songs.
And what of my old drives? As a matter of policy, SRS holds on to the original drives for about a month. This gives the user time to get the rescued data plugged in, backed up, confirmed, and so on. Because accidents can happen. Shipments get lost or dropped. Another disaster could strike before everything gets finalized and copied. So as part of its service, Seagate keeps the original media as well as its rescued “good media” image for a few weeks…just to be safe.
The only time customer data ever leaves SRS is on that lone shipment back to the owner. As a matter of course, Seagate does reuse its in-house drives for additional customer jobs (until they show any signs of wearing out), but every one of these drives finishes its rotation with a spin though a thorough wiping routine, as pictured here. “Wiping” refers to a physical overwriting of every sector on the media, unlike “deleting,” which leaves most bits untouched. There are military-class specifications for drive wiping, and many organizations will run three or seven wiping passes to be ultra-safe. Seagate wouldn’t disclose its exact procedure in this case, but with so much riding on its reputation with corporate clients, we’re confident that the wiping process is thorough and effective.
Seagate’s wiped drives go back into internal circulation. Customer-wiped drives take one more step down the process line…into an industrial shredding machine.
Twenty-eight man-hours for a NAS recovery does not come cheap. During my initial call to SRS, I was quoted a range from $3000 to $20,000 for the four-drive recovery, depending on the nature and severity of the damage. Yes, I went through that initial jaw-dropping moment of financial terror. But would you max out a credit card to save your family history? Well, you may not have to. As mentioned earlier, the majority of recoveries can be done simply with software, and Seagate now sells its File Recovery Software for $99. You can try before you buy. If you don’t feel comfortable taking the DIY route, you can let an SRS tech remote into your system, then install and run the software on your behalf for $199. In-lab recovery starts at $399, which includes inbound and return shipping as well as the initial pre-analysis. There are cheaper recovery services, and you’ll probably find local recovery companies in your area. In general, these will be working on logical recoveries. Many such local firms contract out to SRS for tougher jobs involving physical damage.
While I'm immensely glad and grateful to have my 3 TB of data back, I'd prefer not to go through this again. To that end, I’m now reviewing a new NAS, as well as backing up everything to CrashPlan. So I will have all of my data replicated from my internal drives to the NAS through a backup app (SyncToy for now) as well as replicating out to the cloud. A cloud backup of 3 TB will literally take months. Even now, about five weeks into the process, I haven't even cleared the first terabyte yet. But with the NAS in place and my most critical files in the first 700 GB, I feel content and covered.
For those who don't want the hassle of these steps, you might want to investigate subscribing to a service like Seagate's Rescue and Replace. In a nutshell, you can insure (my word, not Seagate's) a drive for two ($40), three ($50), or four ($60) years. In the case of just about any drive problem, you can send your drive through SRS's recovery process and get a replacement drive in return. If you're content to download the data from Seagate and don't need a new drive, deduct $10 from those numbers. The service is tied to the drive's serial number and so is not transferable. Seagate obviously knows what the annual failure rates are on drives and will no doubt profit well from the service. But $50 for three years and you never have to worry about the drive failing and losing its contents? Under normal circumstances, that might seem steep. But every time I think about that initial price quote for recovery, well...$50 sounds pretty cheap.
This isn't a commercial. Seagate helped to bail me out of a horrific situation and did a stellar job at it. But I heartily encourage you to shop around with various storage services and solutions, and I hope you never need to send a drive off for recovery. If you do, now you'll know what will happen to your precious drive once it leaves your hands and perhaps sleep a little easier through the nightmare.