RAID is a common data storage system and it is still susceptible to failures. RAID systems offer redundancy, error correction and fault tolerance, but bigger issues may still happen. Other than hardware damages, logical corruptions could also happen to a RAID system. If the problem is serious enough, the effect could be quite devastating, similar to losing data during a big disk failure. It is obvious that problems are more likely to happen in RAID 0 system and it is a rarely used format due to much lower data protection features. Regardless of the RAID levels used, it is important to make sure that data can be recovered from a failed disk.
Although the probability may be low, it is still possible that both duplicates of files in a RAID system is corrupted. RAID uses data striping technique and it means that we could lose 1 byte of data out of ever 4 byte of data. When the RAID 0 system fails, it is quite unlikely for files to survive intact. Smaller files could still be accessible, but not with larger files. For various corporate usages, such as SQL databases and Exchange, a failed RAID system can be disastrous, resulting considerable data loss and interruptions of business operations. In this case, it is important to choose a RAID system that includes parity capability.
A RAID system with data correction and parity information features have a bigger chance of surviving failures. Any kind of disk failure shouldn’t be ignored and it is important to have a spare. In fact, we may need to have backup RAID system that stores two more duplicates of our critical files. Also, a running RAID system with a single damaged hard drive offers near-zero protection against subsequent failures. When a RAID system encounters problems, it is important to repair it by replacing the failed drive. The new drive should be immediately refilled with data from surviving duplicate in the same RAID system.
Usually, hard disks in a RAID system are consisted of similar models from the manufacturer. It means that we could contact the vendor when a hard drive fails. This is particularly true in a data center where hundreds, if not thousands of RAIDed hard drives are operating. The vendor may agree to provide direct supports and provide replacement as well restoring the partially damaged RAID systems. If the hard drives fail during a period of warranty, they can be immediately replaced. This will ensure that a RAID system can be fully fault tolerant. Replacement of damaged hard disks is necessary because this could de-stabilize the SCSI bus.
Other than hardware failures, problems in a RAID system could also be caused by configuration loss. There could be a failure in the RAID controller and if this vital information is lost, it’s possible that the RAID system will no longer work. Damaged RAID controllers may require re-programming. When using a reliable RAID system, it is no longer necessary for us to frantically repair problems. However, each issue, problem and failure should be remedied soon. Everything we do should be entirely constructive and won’t cause overall problems.