September 08, 2008 Archives

09.08.2008 23:46

Failure rate of large SATA drives

I came across two fascinating articles today about why SATA RAID can be scary and dangerous:

Basically, drive size has increased dramatically. The "bit error rate"[1] has remained essentially constant. As you approach something around 12TB, it's virtually certain that you have unreadable blocks on your array.

Who cares if a couple of bytes in the middle of some old archived log files are munged? I sure as hell don't. Unfortunately, most failed blocks are only discovered during the one process that requires reading and checksumming every block of data on the array: Rebuilding the array after a drive failure. What happens then? Depends on the RAID controller. Apparently it's pretty common for it to give up and leave you with no useable data at all.

It wasn't until I read this article that I understood why the über-smart folks at Coraid implemented a technology called RAIDShield that basically continuously reads every sector of every disk in the background. If a bad block is found, the correct data is calculated from the parity data and written out to the "bad" sector. This triggers the drives internal sector remapping, guaranteeing that the data is written to a good portion of the disk and flagging the sector as unusable.

I really like knowing our vendors have solved problems we didn't even know we had. Makes me feel all warm and fuzzy.

[1] The ratio of the number of unreadable blocks to the total number of blocks on the drive/array.


Posted by Insyte | Permanent Link | Categories:: Unix Stuff