RAID 5 May Be Doomed in 2009
A story appearing online is forecasting the doom of RAID 5 in 2009. Apparently with storage capacities of modern SATA hard drives now reaching 2-terabytes in size, the odds of a read error during a RAID 5 disk reconstruction is becoming unavoidable.
According to Zdnet, SATA drives often have unrecoverable read rates (URE) of 10^14, which implies that disk drives will not be able to read a sector once every 100,000,000,000,000 bits read. With hard drive capacities expected to reach two-terrabytes in 2009, the odds of a read error become practically unavoidable when recovering from a 7-drive RAID 5 disk failure. Upon encountering such a read error during a reconstruction process, it is claimed that the array volume will be declared unreadable and the recovery processes will be halted. Apparently all 12-terrabytes of data stored on the drives will be lost... or at least will require some extra effort and knowledge to recover.
RAID 5 is described as a striped set with distributed parity, which protects against a single disk failure. When a drive fails in a RAID 5 set, the failed drive can be replaced, the data can be rebuilt from the distributed parity and the array can eventually be restored. If more than one drive fails however, the array will have data loss. For some, this can make the reconstruction process after a single drive failure a stressful event, as the array during that time will be vulnerable to more drive failures.
While using RAID 6 instead may seem like a solution, where RAID 6 is two drive failures are allowable instead of just one, the increased redundancy may not be cost effective. Also, as hard drive capacities continue to increase exponentially, year after year, even RAID 6 may soon become prone to the same problems. When single disk drives become 12-terrabytes in size, even a direct drive-to-drive copy may commonly encounter these read errors. The use of disk drives that have smaller capacities and improved unrecoverable read rates could be a solution to avoid these potential headaches.
The problem comes from the increasingly tight data density packed onto drive platters. Using traditional means, bit magnetic poles can often leak their polarity onto other adjacent bits, causing a switch in an otherwise normal bit. Manufacturers have switched to perpendicular recording methods to avoid such problems and increase density, but even this method has its physical limits. Manufacturers will have to find more creative solutions down the road if drives are going to exceed 2TB in size.
Surely we are nowhere near that right?
i agree. i remember when i got my dell p4 system in early 2004. it came with a 120GB hard drive which was fairly impressive for its time, and now we're already up to 1.5TB, more than ten times the capacity in only 4 years. getting up to 12TB may happen within the next couple of years.
i.e. If one 12TB drive fails in RAID5, it's POSSIBLE (or likely) a reconstruction can fail. If one 12TB drive fails with no RAID, all data is 100% gone for sure.
Where can I find more info on this?
Thanks!
The point of this article was that the increase in drive size will increase the chance of read errors, thus increasing the chance of an error happening while you are rebuilding your array. According to this article, if there is a read error during the reconstruction then the whole array will be lost. If you'll remember from earlier, the read error will be more likely due to the size hard drives will be in the near future. So, in conclusion, your experience with hard drives in the present has no relevance to the issues of larger hard drives of the future, which is the subject of this article.
I couldn't give you info that is 100% technically sound on this topic (or a terribly elegant one), at least compared to some hardware junkies...but I'll take a stab at it.
Consider, that with one drive you are dealing with the probability of one drive failing. That is, the probability that one drive is defective, or has an off 'moment' inside, or is just too old, or whatever. This is the chance that the one drive will 'become another statistic', as I'll call it.
The simple explanation here provides that the more drives you have, the better chance you have for a single drive failure. So instead of having 1 chance for one drive to fail...you have 5 chances for one drive to fail. Then, the more drives you add, even better the probability that one of them will 'become another statistic.'
The problem enters when you consider that (after a drive failure) rebuilding an array of a certain size takes a quantifiable amount of time and number of disk operations. I have not personally done the math, but the idea is that each drive may (for argument's sake) have a 1 in 10,000 operations error rate, and rebuilding an array of the specified size or disk count may require 10,001 disk operations from EACH disk; and this is assuming that each drive operates within its specified tolerances. Another drive really could fail altogether.
The basic point is that the numbers can catch up to you
!go Linux!
It still seems like RAID6 is the better choice over no RAID. (the title of the article can be more clear)
Lets hope the rebuilding time also decreases with capacity increases.
It's a bit like announcing the end of the world because X is going to happen, yet not giving any practical solution for how X can be solved. The negative tone of the whole article is just like saying to us that we're doomed to lose our whole data every once in a while and that, oh, it's life... I'm sure there are already researches being conducted to circumvent such an issue, I'm sure technologies will show up in due time, and I'm pretty darn sure the drives will get larger and larger without stopping. There have always been seemingly unsurmountable issues, yet here we are today, through all those impossible obstacles and still living to talk about it!
Error correction routines must become more robust otherwise all data will be corrupted, by their logic.
CRC errors happen constantly with I/O, you never notice them, or maybe you do..that weird hiccup or glitch. RAM has ECC as well, so does your CPU.
Our favorite hobby is awash in bit flipping and transients, if it wasn't for error correction algorithms, absolutely nothing electrical would work...hell all MP3's are just one big approximation of what you might be lisenting to.
I dont see why this is all so melodramatic. Just increase the parity information with each strip. Big deal, as if we cant afford it.
Smaller drives are better.
The more drives you have the higher the likely hood that you have multiple failures at the same time.
RAID 6 is expensive and requires hardware that is expensive. If you've ever bought a Server you will understand. One of our PowerEdge servers cost over 4 grand and that just had sata drives. Throw in 15k SAS drives and you're looking at 7-8grand. Now you need windows server...
This is old news anyway. Companies have and use redundant systems now. Companies don't rely on RAID all that much. A lot of companies also invest in imaging software like Acronis Echo or Symantec Ghost as well. As soon as your company experiences multiple drive failures in a RAID 5 they will switch to redundant systems faster than you can ship the server. You'll be buying the server that same day.
Exactly. Keep each RAID-5 volume under the size where read errors are statistically likely, and it's a non-issue. Assuming the scenario being suggested does occur, then it will simply become an industry standard that all RAID-5 arrays are built under the safe size limit. Either the drive designers will figure out how to resolve the problem, or another technology will replace hard drives, but no corporation or responsible individual is going to risk data corruption in exchange for capacity. Especially when media distribution is becoming increasingly an online enterprise...does anyone think that you'll be able to re-download your multi-terabyte movie collection for free (legitimately, I mean) when your array craps out?
Also, let's not forget that RAID-5 is not a backup solution. It's a high-availability solution.
It's a little bit more complicated as it requires planning of how much space this particular section will need. If you use roaming profiles will you need 1TB or 2?
The whole idea behind RAId is that it it an inexpensive/independent array of disks that is safer or faster than JBOD, so with 3 drives in a RAID 5 i have safety and speed, but the safety implies hours of recovery time, during which not a single bit on the entire array can go wrong, and the speed is just for reads, writes on RAID 5 are longer. with just one more disk i can make a RAID 10 solution, for the best of both worlds, or i can go cheap and use the very same 3 disks for a RAID 0 + external dayly backup-an elegant way to have lighting fast performance and fast system recovery.
On business solution the costs of delaying to much to recover the array far exceed the price gap between RAID 5 and RAID 10/RAID 6, and with the increased likehood that the array rebuild will be unsucessful this cost gap will be reversed.
Lets all say no to RAID 5!!
I then quickly got the external drive to be double sure the data is safe.
This works well for me for home-use.
2 cents worth..