How Seagate Tests Its Hard Drives

Into The Labs

Let’s take a look through Seagate’s product assurance lab and see how some of this work actually gets done.

First up, we run into a pair of temperature-humidity test chambers. These units serve as something of a Swiss Army knife for drive testing, able to dial temperature up to 100 ˚C (212 ˚F) and down to a frosty -50 ˚C (-58 ˚F). It also provides for relative humidity control and independent 5V and 12V voltage. This test goes beyond the temperature, humidity, and power functional specifications of the HDD. Each drive slot within the test chamber features its own single-board computer that ties back to a server system running both scripts and application software developed in Seagate’s reliability department. This gives Seagate engineers full control over all test parameters, remote monitoring and management control, and complete logging capabilities.

Seagate’s hardware architecture ensures that a failure on any given slot won’t interrupt the others being tested. All available data from test systems helps to assess the drive state, including drive parametrics, upstream parametrics and drive failure analysis. Every failure is thoroughly investigated. When needed, a drive will get written up and sent off to the failure analysis lab, which then figures out the root cause of failure. Firmware issues get routed back to the FW team for investigation. Mechanical or system failures will get torn down and investigated until the problem is understood.

When quantities start to increase, engineers step up to higher capacity burn-in test chambers, each capable of holding over 100 drives. These units only serve to run scripts at elevated temperatures.

A second type of testing executes in high-capacity temperature chambers. These chambers carefully manage airflow characteristics to maintain consistent drive temperatures. As you can see, Seagate runs a lot of these machines, and drives are kept spinning around the clock. A third type of testing runs in hot rooms where the room temperature, rather than the drive temperature, is controlled. This test approach provides for both higher drive density as well as conditions closer to that of a large data center.

Taking many steps back, here’s another vantage on the main reliability testing room. The view here doesn’t do justice to the physical impression made by being in the midst of rank upon rank of these beasts, all chugging quietly away in the pursuit of creating drive failures.

Throughout the lab, we observed drives being tested at different temperatures. Often, drives are tested at, or above, the product temperature specification. The purpose here is twofold. First, engineers need to make sure that drives will perform reliably at these temperatures, as they may well encounter them in environments such as data centers or deserts. Second, elevating temperature serves to add stress to the device and thus accelerate expected failure times. Engineers don’t have five years to wait around and tally up drive failures at room temperature, but they can make educated guesses at this number based on many years of heightened temperature testing and subsequent analysis.

Could a company fudge these methods and cheat its test algorithms? Sure. Would it be short-sighted and self-damaging when those drives started failing prematurely? Absolutely, so Seagate makes every possible effort to calibrate its temperature/reliability result curves and ensure their long-term accuracy.

Of course, sometimes all you need is to test drives at normal, room temperature conditions. These racks serve primarily to run scripts on SAS and SATA drives as well as perform load/unload tests. Here’s a close-up view:

Drive failures can also be accelerated through altitude testing. Following are different views of Seagate’s altitude chambers. These machines can create air pressure conditions ranging from roughly 200 feet below sea level to 40,000 feet (for non-operation). This is critical, because heads fly so microscopically close to disk media that even tiny changes in pressure within the drive affect the “air bearing” between heads and disks. This could yield either inability to read/write data or, even worse, contact between the heads and platters, resulting in potentially catastrophic data loss. These chambers can also adjust temperature to correspond with air pressure, such as the type of chilly environment you might expect at the top of Wyoming’s Grand Tetons.

Continuing through the lab, we find additional testers running non-operational tests. The idea of having non-operational test chambers be more stringent than operational ones often has to do with drives being in transit. For example, a pallet of drives might get left on the airport tarmac in China or Ecuador for hours at a time, where heat and humidity could be extreme. This is why Seagate places drives in chambers such as those in the above shot and lets them sit soak at high humidity and intolerable temperatures (to humans) for extended periods. Samples may occasionally be pulled during that time, but most will sit out the full duration. The drives then get shuttled to the chemical lab for tear-down and exhaustive examination, as we’ll soon see.

Naturally, someone needs to monitor all of that testing equipment. The wall display you see above shows all of the test chambers in Longmont’s lab. In the event of an issue, the board will display various codes and color-based alerts. It also notifies technicians when chamber maintenance and calibration are due. We watched as one worker switched views to show a similar dashboard for all of the chambers running in the Thailand facility.

“If a compressor goes bad and it’s shaking the chamber, this is going to set off a trigger,” explained one engineer. “The drives themselves are telling this dashboard, ‘Hey, I'm seeing a lot of NRRO—non-repeatable run-out. The drive itself is saying, ‘There’s a lot of vibration energy. What's going on? Did we lose a compressor? Are the fans out of balance or something?’ We have those sorts of measurements. The chamber is set at a certain temperature, and if it can’t keep that temperature for whatever reason, it will trigger a notice to say, ‘Hey, I can’t stay at 0 ˚C. Maybe the compressor’s going out, and I can’t cool it well enough.’ Again, this chamber data is all separate from the drive data we’re pulling. It’s pretty amazing what we can monitor.”


MORE: Best SSDs For The Money
MORE: How We Test HDDs And SSDs
MORE: All Storage Content

  • tom10167
    Awesome photos. I don't know what the last picture is but I know I need one of those in my house.
    Reply
  • Rookie_MIB
    Awesome photos. I don't know what the last picture is but I know I need one of those in my house.

    That is an enterprise storage rack full of 2u hotswap chassis. 18 chassis, 12 drives per chassis = 216 drives @ 6tb (?) per drive = 1,296 terabytes or 1.3 Petabytes.

    You could store a lot of TV shows or movies on that thing. Imagine how many of those are used for YouTube? Yikes. They get 300 hours of footage uploaded every minute.
    Reply
  • Steelbridge
    https://www.backblaze.com/blog/hard-drive-reliability-stats-for-q2-2015/
    Reply
  • Mike-TH
    So if their testing is so good, why are their drives among the worst for reliability - to the point where most IT people I know actually refuse to use them, or if forced to use them will keep (and use) more spares than for other makers.
    Reply
  • Tom20160027
    The article explains the different types of drive/MTBF and why the backblaze test is useless information. Marketing plot to have folks talking about it and re-posting its link. It seems to work as we keep seeing the link over and over... They are not getting my data. They put drives designed for desktop into servers and run them to the ground and call it a "reliability test". Let's test my kids bicycle with training wheels at the Tour de France and complain about its quality....

    I know IT folks that refuse to use other brands of drives as well. I know IT folks that refuse to use servers from this brand or that brand. We can find anecdotal information about anything. It does not make it true.
    Reply
  • Glock24
    Seagate tests their drives? I thought they didn't!

    I've had more Seagate drives die without warning than any other brand. The only ones that have survived are some old 250GB Barracuda ES. All other models I've owned had lots of bad sectors or just stopped working before the first year, but SMART almost always says the drive is fine!
    Reply
  • zodiacfml
    Yawn. All I think of right now is that HDDs will become the tape drives of the past.
    Reply
  • Garrek99
    The only drives I've ever had go bad on me were Seagate drives.
    Every other drive I've ever purchased simply became obsolete due to size and thus replaced.
    They should be reading about how the other drive makers do their testing and learn from that. Hahaha
    Reply
  • rosen380
    Maybe things changed... but all of my old SGI machines always had Seagate drives in them and the 20+ year old drives all still work. Hell look at what these drives *sell* for on eBay:
    http://www.ebay.com/sch/i.html?_sc=1&_udlo=0&_fln=1&_udhi=200&LH_Complete=1&_ssov=1&_mPrRngCbx=1&LH_Sold=1&_from=R40&_sacat=0&_nkw=%28st31200N%2C+st32171N%2C+st32272N%2C+ST34371N%2C+st34520N%2C+st34573n%2C+st39173N%2C+st318417N%2C+st52160N%29&_sop=16


    4.5 GB drives *selling* for $150+ I see a 2Gb for $120.

    They must have been pretty decent at some point if SGI was putting them in their $5000-20000 workstations and people are spending $40+ per GB to get these now...


    Reply
  • rosen380
    Link was too long... http://tinyurl.com/gntz4p2
    Reply