How RAID-5 really works

Dec 11, 2007 at 4:02 am - 12 Comments

What is RAID?

RAID is an acronym for "Redundant Array of Independent Drives," or "Redundant Array of Inexpensive Drives." The main idea behind RAID is the ability to take multiple drives and have them virtualised as a single drive. There are many different RAID structures, all of obtain one of two primary purposes: more storage space or more data redundancy (as in, protection against data loss in the even of hard drive failure). I'm not going into the details of all the RAID levels as this post is more geared towards the lower level workings of RAID-5, but if you wish to learn about all the RAID levels, see this Wikipedia Article On RAID.

What is RAID 5?

RAID 5 provides a very redundant fault tolerance with performance advantages that allows data to be safeguarded while only sacrificing the equivalent of one drive's space. Level 5 requires at least 3 hard drives of the same size; The total storage space available with a RAID 5 is equal to { (number of drives - 1) * size of smallest drive }. So if you use three 120gb hard drives, you will have 240gb of actual usable space. If you used five 120gb hard drives, you would have 480gb of usable space. The more drives you use, the more efficient your storage space is.

Data Redundancy

Your data can survive a complete failure of one hard drive, but if two drives fail at the same time, ALL data is lost. It is very important to have an extra drive on hand so that if a drive fails, you can replace it immediately for data rebuild. The RAID array can actually still be used with one drive completely missing or not working, but performance is degraded as the data must be rebuilt on the fly; however If you do not have an extra drive to plug in right away when one fails, it would be wise to keep the computer and all drives powered off until you can replace the failed drive. You may think "oh it will only be a couple days before the new drive arrives", but ask yourself this: Is having access to the data on these drives for only a couple days worth taking the risk of losing it all forever if another drive happens to fail? Probably not.

Raid 5
Figure 1. Representation of RAID-5 data structure.

Striping & Parity

Data is "striped" across the hard drives, with a dedicated parity block for each stripe. A, B, C, and D represent data "stripes." Each stripe segment per drive can vary in size; I believe anywhere from 4kb to 256kb per stripe is normal and can be set during setup to adjust performance. The blocks with a subscript P are the parity blocks which are a representation of the sum of all other blocks in that stripe (explained in more detail below). The parity is responsible for the data fault tolerance and is also the reason why you lose the amount of space equivalent to one drive. Using figure 1, lets say that the second drive gets corrupt and dies. A new hard drive would be put in its place, and the RAID controller would rebuild the data automatically. The data in segments A1, A3 would be compared to the AP parity block, which would allow the data for A2 to be rebuilt -- as goes for all other stripes that must be recovered. Parity blocks are determined by using a logical comparison called XOR (Exclusive OR) on binary blocks of data which will be explained down further.

Performance

RAID 5 offers accelerated read performance because the data stream is accessed from multiple drives at the same time. Referring to figure 1, lets say that stripe A was a single file. Normally on a single drive, when you open that file, the whole thing would have to be streamed from the one hard drive, thus the one hard drive's max read speed is going to become a bottleneck. BUT, with a RAID 5, that one file can be accessed in 1/3 the time because it will be read from all 3 drives at once; block 1 has the first 1/3 of the file, block 2 has the second 1/3 section of the file, and the block 3 has the last part of the file. This, in the perfect situation, causes your read speed to be tripled -- with even more performance potential with arrays containing more hard drives!

The downfall to this is that there is an additional overhead when writing to the disc. This overhead is caused from parity calculation. Every single bit written to the drives must be compared and processed to create a parity block. If your intended use involves a lot of data writing (such as video recording, high traffic server, etc) raid 5 would not be the most ideal choice.

XOR Comparison

As I'm sure you already know, data is stored and processed in binary, which is of course 0's and 1's. There are methods of comparing binary bits called operators. The one that does the magic of parity creation is called XOR, or Exclusive OR. If you have experience in lower level programming or electronics, you probably already know what an XOR is.

XOR Input-Output
Figure 2. XOR Inputs/Outputs

Basically, an XOR will take two binary bits, compare them, and output a result of 0 or 1. It will return a 1 ONLY IF the two inputs are different. If both bits are 0, the output is 0. If both bits are 1, the output is 0. If one bit is 0, and the other bit is 1, the output is 1.

Parity Examples
Figure 3. Yellow cells represent parity blocks.

Building Parity

For easier understanding/explaining, we are only going to be working with 4-bit blocks; Actual data blocks can range from 4kb (32,768 bits) up to 256kb (2,097,152 bits), but the method is exactly the same regardless of how many consecutive bits you work with. In figure 3, the yellow blocks represent the parities for each stripe. As you probably notice right away, how the parities are distributed between all drives; This provides a slight increase in performance, and is was separates RAID 4 from RAID 5 (RAID 4 keeps all parities on a single drive).

Lets examine the first stripe of figure 3. To compute the parity, we must run the XOR comparison on each block of data in that stripe. You XOR the first two blocks, then take the result, and XOR it against the third block. (and continue this for all discs in the array, except the block where the parity will be stored, of course.)

(Drive 1) XOR (Drive 2) = (0100) XOR (0101) = (0001) (Result) XOR (Drive 3) = (0001) XOR (0010) = (0011)

Let me break that down a little more in case you couldn't follow. Refer to figure 2 if you have trouble remembering the inputs/outputs for XOR

First we need to compare the first two drives' blocks which are 0100 and 0101. The very first bit comparison is 0 and 0 (the first bits from both blocks) which results 0, the first bit of our temporary parity. The second bits are 1 and 1, which results 0. So far, our temporary parity is 00. Now the third bit comparison is 0 and 0, which yet again, returns 0. We are now at 000. The fourth bit comparison is 0 and 1, which results 1. So the result of (Drive 1)XOR(Drive 2) is 0001. We now must take this block, and compare it to drive 3 which is 0010. The XOR of 0001 and 0010 equals 0011, which is the parity for stripe 1!

Recovering Data

The very cool thing about XOR comparisons, and what makes RAID 5 possible, is that if one value comes up missing, you can always find the missing value by running an XOR on all the available values! Referring back to diagram 3, lets say that drive 1 failed beyond fixing. The user will be prompted by the raid controller and alerted that a drive has failed, and must be replaced. As soon as a new drive is put in, the controller will automatically start rebuilding the lost data. Here is how we rebuild drive 1, stripe 1

(Drive 2) XOR (Drive 3) = (0101) XOR (0010) = (0111) (Result) XOR (Drive 4) = (0111) XOR (0011) = (0100)

As you can see, the final result is 0100, now refer back to figure 3 at drive 1, stripe 1.... sure enough, its 0100! Amazingly, right? Just for fun, lets rebuild stripe 2 as well assuming drive 1 died.

(Drive 2) XOR (Drive 3) = (0000) XOR (0110) = (0110) (Result) XOR (Drive 4) = (0110) XOR (0100) = (0010)

The missing block was calculated as 0010. Take a look at figure 3 to verify what drive 1, stripe 2 was before the failure and see if it matches the computed value... of course it does!

Well I hope you have enjoyed this post. It took me a great deal of searching to find this data when my own curiosity got to me, and I couldn't find any articles that explained all of this in one, so I decided to write this article and hope someone finds it helpful!

  • Digg
  • Reddit
  • Sphinn
  • del.icio.us
  • Slashdot
  • StumbleUpon
  • Technorati
  • Facebook
  • Mixx
  • Google
  • Live
  • TwitThis
  • YahooMyWeb
  • blinkbits
  • BlinkList
  • blogmarks
  • BlogMemes
  • Blogosphere News
  • Blue Dot
  • Bumpzee
  • Fark
  • Furl
  • IndianPad
  • LinkedIn
  • Ma.gnolia
  • MisterWong
  • Netvouz
  • Propeller
  • description
  • Socialogs
  • SphereIt
  • Blogsvine
  • connotea
  • eKudos
  • IndiaGram
  • LinkaGoGo
  • LinkArena
  • Linkter
  • Meneame
  • NewsVine
  • PlugIM
  • ThisNext

Related Topics

Last 5 Linkbacks

Comments (12)

Joe's Avatar

Joe Dec 12, 2007

That is cool!

rob's Avatar

rob Jan 06, 2008

nice one!

Dave's Avatar

Dave Feb 17, 2008

As the number of drives (and the size of the drives) goes up the chances of a disk failure also go up. At some stage, the chance of two disks failing in the time it takes to replace the first disk approaches 1.

We have a few machines at my work with 48 disk, each holding 500GB. (You can get these boxes now with Terabyte disks...) and if we just made the whole thing a big RAID 5 array then the failure rate would be too scary. As it is, we have some RAID 5 in there along with some RAID 1 0 but we also have hot-spare disks in the machine that are unused so that if any single disk fails we can restore the RAID 5 array back to full health immediately. We keep one hot-spare for every RAID array in the box.

Just something to keep in mind when you are setting your RAID arrays up. Having a couple of cold spares on hand is not such a bad idea either.

craig cowan's Avatar

craig cowan Jun 02, 2008

for the first time in a week of trying to get my head fully round this i finaly did thanks to this document!!!! Thankyou!! :)

Binoy Nicholas's Avatar

Binoy Nicholas Jun 07, 2008

To say the least,Fantastic!! Such precise and simple it took only 10 minutes for a novice like me to conceptualize the scenario.. thanks Scott for this nice stuff..

Blue's Avatar

Blue Jun 16, 2008

if we update a single 512 byte block on one disk - will the parity be recalculated for just that block or for the whole stripe block (3 x 32KiB stripe size)
reason I ask is I have seen RAID5 performance drop to around 1MiB/s when cache is saturated which corresponds to reading around 64 blocks

scott klarr's Avatar

scott klarr Jun 17, 2008

Blue, I beleive that any change, even if only a single binary difference, will require that whole strip to be recalculated.

Anand's Avatar

Anand Aug 04, 2008

What happens when there is a data corrutption, let say disk 2. We can see that the parity does not add up, but how do we know which disk the error actually is in ?

Sahkan's Avatar

Sahkan Aug 24, 2008

Thanks for writing the article, Great Jop !

chandra's Avatar

chandra Sep 12, 2008

if there are 10 hdd in raid 5 array, will data loss if 2 hdd fail same time?

Paul J.'s Avatar

Paul J. Sep 25, 2008

I was the sysad on 28TB of NetApps RAID that utilized RAID 5, separated into six different RAID groups, with multiple bricks per group. NetApps has redundancy throughout its entire setup, with dual heads, dual data channels, etc. The way that the RAID is setup is that in a RAID group, each brick has a parity drive and hot spare set aside. If there are five bricks, then there are 5 hot spares that can be utilized by any of the other bricks within the group, should a drive fail. You could lose up to 5 drives on any brick before you would start to lose data. It would be very hard to lose any data with this setup as long as you are checking your drives on a daily basis. And no, I don't work for NetApps, I just like how their built-in redundancy make for a very flawless RAID setup.

Adarsh Kumar's Avatar

Adarsh Kumar Nov 07, 2008

Very well explained. I enjoyed it.

Post Comment

You are replying to cancel reply

Your email address will not be visible to the public

Avatars by Gravatar. Join Gravatar for free to have your avatar shown everywhere you post.

page counter

Loading Ad