- Detecting, querying and testing
- Detecting a drive failure
- Querying the array status
- Simulating a drive failure
- Force-fail by hardware
- Force-fail by software
- Simulating data corruption
- Monitoring RAID arrays
- RAID Administration
- Contents
- 1. GENERALLY SPEAKING, WHAT IS THE DIFFERENCE BETWEEN THE «CHECK» AND «REPAIR» COMMANDS?
- 2. CAN «CHECK» BE RUN ON A DEGRADED ARRAY (say with N out of N 1 disks on a RAID level 5)? I can test this out, but was it designed to do this, versus «REPAIR» only working on a full set of active drives? Perhaps «repair» is assuming that I have N 1 disks so that parity can be WRITTEN?
- 3. RE: FEEDBACK/LOGGING: it seems that I might see some messages in dmesg logging output such as «raid5:read error corrected!», is that right? I realize that «mismatch_count» can also be used to see if there was any «action» during a «check» or «repair.» I’m assuming this stuff doesn’t make its way into an email.
- 4. DOES «REPAIR» PERFORM READS TO CHECK THE ARRAY, AND THEN WRITE TO THE ARRAY *ONLY WHEN NECESSARY* TO PERFORM FIXES FOR CERTAIN BLOCKS? (I know, it’s sorta a repeat of question number 1 2).
- 5. IS THERE ILL-EFFECT TO STOP EITHER «CHECK» OR «REPAIR» BY ISSUING «IDLE»?
- 6. IS IT AT ALL POSSIBLE TO CHECK A CERTAIN RANGE OF BLOCKS? And to keep track of which blocks were checked? The motivation is to start checking some blocks overnight, and to pick-up where I left off the next night.
- 7. ANY OTHER CONSIDERATIONS WHEN «SCRUBBING» THE RAID?
- I/O Schedulers
Detecting, querying and testing
This section is about life with a software RAID system, that’s communicating with the arrays and tinkertoying them.
Note that when it comes to md devices manipulation, you should always remember that you are working with entire filesystems. So, although there could be some redundancy to keep your files alive, you must proceed with caution.
Detecting a drive failure
Firstly: mdadm has an excellent ‘monitor’ mode which will send an email when a problem is detected in any array (more about that later).
Of course the standard log and stat files will record more details about a drive failure.
It’s always a must for /var/log/messages to fill screens with tons of error messages, no matter what happened. But, when it’s about a disk crash, huge lots of kernel errors are reported. Some nasty examples, for the masochists,
kernel: scsi0 channel 0 : resetting for second half of retries. kernel: SCSI bus is being reset for host 0 channel 0. kernel: scsi0: Sending Bus Device Reset CCB #2666 to Target 0 kernel: scsi0: Bus Device Reset CCB #2666 to Target 0 Completed kernel: scsi : aborting command due to timeout : pid 2649, scsi0, channel 0, id 0, lun 0 Write (6) 18 33 11 24 00 kernel: scsi0: Aborting CCB #2669 to Target 0 kernel: SCSI host 0 channel 0 reset (pid 2644) timed out - trying harder kernel: SCSI bus is being reset for host 0 channel 0. kernel: scsi0: CCB #2669 to Target 0 Aborted kernel: scsi0: Resetting BusLogic BT-958 due to Target 0 kernel: scsi0: *** BusLogic BT-958 Initialized Successfully ***
Most often, disk failures look like these,
kernel: sidisk I/O error: dev 08:01, sector 1590410 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002
kernel: hde: read_intr: error=0x10 < SectorIdNotFound >, CHS=31563/14/35, sector=0 kernel: hde: read_intr: status=0x59
And, as expected, the classic /proc/mdstat look will also reveal problems,
Personalities : [linear] [raid0] [raid1] [translucent] read_ahead not set md7 : active raid1 sdc9[0] sdd5[8] 32000 blocks [2/1] [U_]
Later on this section we will learn how to monitor RAID with mdadm so we can receive alert reports about disk failures. Now it’s time to learn more about /proc/mdstat interpretation.
Querying the array status
You can always take a look at the array status by doing cat /proc/mdstat It won’t hurt. Take a look at the /proc/mdstat page to learn how to read the file.
Finally, remember that you can also use mdadm to check the arrays out.
These commands will show spare and failed disks loud and clear.
Simulating a drive failure
If you plan to use RAID to get fault-tolerance, you may also want to test your setup, to see if it really works. Now, how does one simulate a disk failure?
The short story is, that you can’t, except perhaps for putting a fire axe thru the drive you want to «simulate» the fault on. You can never know what will happen if a drive dies. It may electrically take the bus it is attached to with it, rendering all drives on that bus inaccessible. The drive may also just report a read/write fault to the SCSI/IDE/SATA layer, which, if done properly, in turn makes the RAID layer handle this situation gracefully. This is fortunately the way things often go.
Remember, that you must be running RAID- for your array to be able to survive a disk failure. Linear- or RAID-0 will fail completely when a device is missing.
Force-fail by hardware
If you want to simulate a drive failure, you can just plug out the drive. If your HW does not support disk hot-unplugging, you should do this with the power off (if you are interested in testing whether your data can survive with a disk less than the usual number, there is no point in being a hot-plug cowboy here. Take the system down, unplug the disk, and boot it up again)
Look in the syslog, and look at /proc/mdstat to see how the RAID is doing. Did it work? Did you get an email from the mdadm monitor?
Faulty disks should appear marked with an (F) if you look at /proc/mdstat. Also, users of mdadm should see the device state as faulty.
When you’ve re-connected the disk again (with the power off, of course, remember), you can add the «new» device to the RAID again, with the mdadm —add’ command.
Force-fail by software
You can just simulate a drive failure without unplugging things. Just running the command
mdadm --manage --set-faulty /dev/md1 /dev/sdc2
should be enough to fail the disk /dev/sdc2 of the array /dev/md1.
Now things move up and fun appears. First, you should see something like the first line of this on your system’s log. Something like the second line will appear if you have spare disks configured.
kernel: raid1: Disk failure on sdc2, disabling device. kernel: md1: resyncing spare disk sdb7 to replace failed disk
Checking /proc/mdstat out will show the degraded array. If there was a spare disk available, reconstruction should have started.
Another useful command at this point is:
Now you’ve seen how it goes when a device fails. Let’s fix things up.
First, we will remove the failed disk from the array. Run the command
Note that mdadm cannot pull a disk out of a running array. For obvious reasons, only faulty disks can be hot-removed from an array (even stopping and unmounting the device won’t help — if you ever want to remove a ‘good’ disk, you have to tell the array to put it into the ‘failed’ state as above).
Now we have a /dev/md1 which has just lost a device. This could be a degraded RAID or perhaps a system in the middle of a reconstruction process. We wait until recovery ends before setting things back to normal.
So the trip ends when we send /dev/sdc2 back home.
As the prodigal son returns to the array, we’ll see it becoming an active member of /dev/md1 if necessary. If not, it will be marked as a spare disk. That’s management made easy.
Simulating data corruption
RAID (be it hardware or software), assumes that if a write to a disk doesn’t return an error, then the write was successful. Therefore, if your disk corrupts data without returning an error, your data will become corrupted. This is of course very unlikely to happen, but it is possible, and it would result in a corrupt filesystem.
RAID cannot, and is not supposed to, guard against data corruption on the media. Therefore, it doesn’t make any sense either, to purposely corrupt data (using dd for example) on a disk to see how the RAID system will handle that. It is most likely (unless you corrupt the RAID superblock) that the RAID layer will never find out about the corruption, but your filesystem on the RAID device will be corrupted.
This is the way things are supposed to work. RAID is not a guarantee for data integrity, it just allows you to keep your data if a disk dies (that is, with RAID levels above or equal one, of course).
Monitoring RAID arrays
You can run mdadm as a daemon by using the follow-monitor mode. If needed, that will make mdadm send email alerts to the system administrator when arrays encounter errors or fail. Also, follow mode can be used to trigger contingency commands if a disk fails, like giving a second chance to a failed disk by removing and reinserting it, so a non-fatal failure could be automatically solved.
Let’s see a basic example. Running
mdadm --monitor --daemonise --mail=root@localhost --delay=1800 /dev/md2
should release a mdadm daemon to monitor /dev/md2. The —daemonise switch tells mdadm to run as a deamon. The delay parameter means that polling will be done in intervals of 1800 seconds. Finally, critical events and fatal errors should be e-mailed to the system manager. That’s RAID monitoring made easy.
Finally, the —program or —alert parameters specify the program to be run whenever an event is detected.
Note that, when supplying the -f switch, the mdadm daemon will never exit once it decides that there are arrays to monitor, so it should normally be run in the background. Remember that your are running a daemon, not a shell command. If mdadm is ran to monitor without the -f switch, it will behave as a normal shell command and wait for you to stop it.
Using mdadm to monitor a RAID array is simple and effective. However, there are fundamental problems with that kind of monitoring — what happens, for example, if the mdadm daemon stops? In order to overcome this problem, one should look towards «real» monitoring solutions. There are a number of free software, open source, and even commercial solutions available which can be used for Software RAID monitoring on Linux. A search on FreshMeat should return a good number of matches.
RAID Administration
The md system has the following functionality available:
echo check > /sys/block/mdX/md/sync_action
echo repair > /sys/block/mdX/md/sync_action
A recent discusion between Roy Waldspurger and Neil Brown:
On a RAID5, and soon a RAID6, I’m looking to set up a cron job, and am trying to figure out what exactly to schedule. The answers to the following questions might shed some light on this:
Contents
1. GENERALLY SPEAKING, WHAT IS THE DIFFERENCE BETWEEN THE «CHECK» AND «REPAIR» COMMANDS?
The md.txt doc mentions for «check» that «a repair may also happen for some raid levels.»
Which RAID levels, and in what cases? If I perform a «check» is there a cache of bad blocks that need to be fixed that can quickly be repaired by executing the «repair» command? Or would it go through the entire array again? I’m working with new drives, and haven’t come across any bad blocks to test this with.
check just reads everything and doesn’t trigger any writes unless a read error is detected, in which case the normal read-error handing kicks in. So it can be useful on a read-only array.
repair does that same but when it finds an inconsistency is corrects it by writing something. If any raid personality had not be taught to specifically understand check, then a check run would effect a repair. I think 2.6.17 will have all personalities doing the right thing.
check doesn’t keep a record of problems, just a count. repair will reprocess the whole array.
2. CAN «CHECK» BE RUN ON A DEGRADED ARRAY (say with N out of N 1 disks on a RAID level 5)? I can test this out, but was it designed to do this, versus «REPAIR» only working on a full set of active drives? Perhaps «repair» is assuming that I have N 1 disks so that parity can be WRITTEN?
No, check on a degraded raid5, or a raid6 with 2 missing devices, or a raid1 with only one device will not do anything. It will terminate immediately. After all, there is nothing useful that it can do.
3. RE: FEEDBACK/LOGGING: it seems that I might see some messages in dmesg logging output such as «raid5:read error corrected!», is that right? I realize that «mismatch_count» can also be used to see if there was any «action» during a «check» or «repair.» I’m assuming this stuff doesn’t make its way into an email.
You are correct on all counts. mdadm —monitor doesn’t know about this yet. ((writes notes in mdadm todo list)).
4. DOES «REPAIR» PERFORM READS TO CHECK THE ARRAY, AND THEN WRITE TO THE ARRAY *ONLY WHEN NECESSARY* TO PERFORM FIXES FOR CERTAIN BLOCKS? (I know, it’s sorta a repeat of question number 1 2).
repair only writes when necessary. In the normal case, it will only read every block.
5. IS THERE ILL-EFFECT TO STOP EITHER «CHECK» OR «REPAIR» BY ISSUING «IDLE»?
6. IS IT AT ALL POSSIBLE TO CHECK A CERTAIN RANGE OF BLOCKS? And to keep track of which blocks were checked? The motivation is to start checking some blocks overnight, and to pick-up where I left off the next night.
Not yet. It might be possible one day.
7. ANY OTHER CONSIDERATIONS WHEN «SCRUBBING» THE RAID?
I/O Schedulers
Starting from version 2.6, Linux kernel has several choices about the I/O scheduler to be used. The anticipatory scheduler seems to be sub-optimal on high (eg [resync]) loads. If your kernel has the CFQ scheduler compiled in, it can be used during a resync.
From the command line you can see which schedulers are supported and change it on the fly (remember to do it for all devices composing the RAID):
# cat /sys/block/hda/queue/scheduler noop [anticipatory] deadline cfq # echo cfq > /sys/block/hda/queue/scheduler
Otherwise you can recompile your kernel and set CFQ as the default I/O scheduler (CONFIG_DEFAULT_CFQ=y in Block layer, IO Schedulers, Default I/O scheduler) or simply passing elevator=cfq on the kernel command line at boot time (see the Documentation/kernel-parameters.txt document corresponding to your kernel version).