Comments on RAID: Combining I/O Requests
enlightenment said:When implementing RAID5, processing power is not the only variable when considering throughput performance. In order to write sequentially at high performance, you need to combine I/O requests to form full stripe writes (so called 1-phase writes). Most (cheap) implementions do not do this, and do virtually all writes in 2 phases which involves reading data in order to calculate parity data. This means a minimum of 5 physical I/O requests for EACH logical I/O requests and a lot of head seeking by the physical drives. The result is that the disks themselves become bottlenecks due to higher-than-needed I/O activity, non-contigious writing and even reading in between.
Real controllers using an IO processor do not suffer from this, and also some intelligent software implementations like geom_raid5 for FreeBSD/FreeNAS are capable of request combining. Without it, do not expect much of throughput performance. Also the many seeks can cause wear on the disks, potentially reducing their lifespan.
This is a quote from Enlightenment in another thread, I wanted to comment on it but it enters into a dissimilar topic from the original thread, so I started a new one.
Enlightenment is talking about a very important point here, and it highlights a big difference between many implementations of RAID. I/O write combining under RAID 5 is very important to increase RAID 5 write performance, and only certain (mostly high-end, dedicated, enterprise-class RAID controller cards) can do it. What is write combining and why is it important? That's what I wanted to talk about and highlight in this thread.
Let's look at a 4-drive RAID-5 with a 64K stripe size.
|-- Disk A 64K --|-- Disk B 64K --|-- Disk C 64K --|-- Disk D 64K --|
Let's assume that 2 files are stored in this area of the array, one 64K file and one 128K file. The array looks like this:
|-- Disk A 64K --|-- Disk B 64K --|-- Disk C 64K --|-- Disk D 64K --| |-- File 1 64K --|-- File 2 128 K --|-- Parity 64K --|
Now, lets look at what has to happen if the operating system first needs to write to file #1, and then a few hundred milliseconds later, needs to write to file #2.
First, the OS issues a write to file #1. It might be the entire 64K file, or might be just one cluster within it (assuming 4K NTFS clusters). Either way, the RAID controller must write not only the changed cluster(s) onto disk A, but also must update the parity block on disk D to maintain the redundancy of the stripe.
There are two ways for the controller to update the parity block. Let's assume here that the OS has sent an entire new 64K block for file #1 to the controller to be written. The controller now has this new 64K block of data in memory. The controller would then read the 64K block from disk B and XOR it with the new 64K block in memory, storing this intermediate resulting 64K block also in memory. The controller would then read the 64K block from disk C and XOR it with the intermediate XOR block, which completes the parity calculation. The controller now needs to write the new computed 64K parity block to disk D, and write the original new 64K block that the OS sent for writing to disk A.
Graphically, this looks like:
From OS | v /--------------\ /--------------\ /--------------\ /--------------\ |- New Block -| XOR |- Disk B 64K -| XOR |- Disk C 64K -| = |- New Parity -| \--------------/ \--------------/ \--------------/ \--------------/ | ^ ^ | v | | v To Disk A From Disk B From Disk C To Disk D
The single file write operation results in 4 disk I/O operations, two reads (disks B & C), and two writes (disk A and D). It also sends an I/O operation to every disk in the array.
There is another way to update the parity block. Because of the commutative and inverse nature of the XOR operation, there is a shortcut that can be taken that reduces the number of disks involved in the operation.
From OS | v /--------------\ /--------------\ /--------------\ /--------------\ |- Old Block -| XOR |- New Block -| XOR |- Old Parity -| = |- New Parity -| \--------------/ \--------------/ \--------------/ \--------------/ ^ | ^ | | v | v From Disk A To Disk A From Disk D To Disk D
This method is what most RAID controllers actually do when only one block of the stripe is getting updated, and though there are still 4 disk I/O operations (2 reads, 2 writes), only 2 disks of the array are involved in the write: the disk being updated (disk A), and the disk holding the parity information for this stripe (disk D). Now, disks B and C are free to perform other I/O operations while this one is also going on.
Now, let's look at all of what has to happen if the OS needs to write the 64K file #1 and then a few milliseconds later needs to write the 128K file #2.
The first write is just like we've illustrated, needing a read from disk A and disk D, and a write to disk A and disk D. So far, 2 reads and 2 writes total, and two 64K block XOR operations.
Now, for updating file #2, this could involve 6 I/O operations involving 3 disks (read from B, read from C, read from D, and then write to B & C for the new file, and write to disk D for the updated parity). Another way to do it would involve only 4 I/O operations but use all 4 disks (read from A, write to B, C, and D). Which way the RAID controller does it is implementation-dependent. Let's assume the RAID controller is smart, and always chooses the method that results in the fewest I/O operations. So for this operation, we have 1 read, 3 writes, and two more 64K block XOR operations.
Total for these two file writes: 3 64K reads, 5 64K writes, and 4 64K XOR operations. Plus, the reads and writes had to be sequential, resulting in a 4-step operation:
1. Read old block from disk A and old parity from disk D. Compute new parity in memory.
2. Write new block to disk A and new parity to disk D.
3. Read block from disk A. Compute new parity in memory.
4. Write new blocks to disks B and C, and new parity to disk D.
Now, let's see the benefits of write combining. What if the RAID controller, after it received the first file write from the OS (disk A, 64K block), it didn't actually perform the write operation? What if it held it in cache and delayed the write?
A few hundred milliseconds later, when the controller gets the file #2 write that will update the 64K blocks on disks B and C, the controller gets to take a massive shortcut. None of the data that currently exists on the disks in this stripe is needed anymore, because all blocks needs to be written: there is a new A, B, and C, and an updated parity block on D. The controller now just performs 2 64K block XOR operations, and 4 writes, one to each disk.
Total using write combining: 0 64K reads, 4 64K writes, and 2 64K XOR operation. And, all 4 writes can be done simultaneously, there is no sequential dependency.
The RAID controller using write combining can achieve potentially 400% faster RAID-5 write performance by intelligently waiting for the opportunity to combine writes together and avoid reading existing data off the array to recalculate parity.
The disadvantage here is the danger of the controller reporting to the OS that a write is completed, when in actuality the pending write is sitting in cache. This underscores the importance of the battery backup unit that many RAID controllers offer. If the write was to a very important part of the disk (directory information or file journal), a power failure or OS crash will leave the file system in an inconsistent state, with the potential loss of data.
Like Enlightenment has said on a few occasions, there are software implementations of RAID that do these things (none available for Windows as far as I know, only Linux). The point is that not all RAID is created equal, and there's much more to it than just throwing XOR computational power at the problem. The controller has to be intelligent and creative to reduce the workload to achieve high performance. This is especially important as the number of disks increases. On a 16-drive array, a write to a 512K file could result in an I/O operation to every disk in the array. If the OS is attempting to write to hundreds of these files, write combining is a must.
That is awesome SomeJoe7777
Are you submitting to Wiki?
I was in on the original thread with Enlightment, him touching on combining I/O requests in RAIDs got me interested and went looking for more info, so far your explaination of this is the best ive seen. It makes me want to go out and get that battery backup for my RAID5 controller.
Thanks for your post MUCH appreciated.