Monitoring RAID Health with the i3 Window Manager

I think I’ve been using this setup for a few years already, but I have never written it down. So let’s fix that. When you’re using a RAID to handle HDD failures (e.g. RAID 1 or RAID 5) you need some way to get notified about hardware failures. The RAID system doesn’t help anything at all if one disk fails, you think everything’s OK, because the system still works, and then the second disk fails.

To my knowledge the built-in way with mdadm is to get a mail notification. Sending mails from my home system would not be my first choice, so I wanted another solution.

I’m using the i3 window manager with i3blocks in the status bar. i3blocks allows me to run external scripts to display some information in the status bar. I’ve added the following block to my i3blocks.conf:

[raid]
label=RAID
command=$HOME/.dotfiles/i3/i3blocks/raid localhost md127
interval=600
markup=pango

This does the following things: It gives the block a label RAID, that’s just text that is always displayed. Every 10 minutes (600 seconds) it executes the script $HOME/.dotfiles/i3/i3blocks/raid with arguments localhost and md127 and displays the output of that command in the status bar, too. pango allows me to return HTML from the script to display colors.

I’m setting localhost for the host, because my script is generic enough to also read the RAID status of other systems. Let’s have a look at the script, it’s quite simple as well.

#!/usr/bin/env bash

host=$1
device=$2

if [[ -z "$host" ]]; then
    echo '<span color="red">missing host</span>'
elif [[ -z "$device" ]]; then
    echo '<span color="red">missing device</span>'
fi

if [[ "$host" == "localhost" ]]; then
    stat=$(cat /proc/mdstat | grep -A1 $device | awk '/blocks/ {print $NF}')
else
    stat=$(ssh "$host" "cat /proc/mdstat | grep -A1 $device | awk '/blocks/ {print \$NF}'")
fi

if [[ "$stat" == "[UU]" ]]; then
    echo '<span color="#00FF00">OK</span>'
else
    echo '<span color="red">'$stat'</span>'
fi

First, we’re fetching the input parameters and perform some validation on them. If we’re on the localhost, we can look at the output of /proc/mdstat directly, otherwise we connect to the remote host via SSH. I’m using an SSH key without password for this.

In /proc/mdstat we’re looking for a line containing the device name, e.g. md127 and the following line. One of the two lines (surprise, the second line of course, otherwise we wouldn’t have fetched it in grep) contains the term block and we’re looking for the last column in that line.

All this line parsing sounds a bit complex, what if something goes wrong? We’re expecting the last column to equal the string [UU] (I have RAID1 with 2 disks everywhere at the moment). If we find anything else, we’re treating the RAID failed. I think considering exactly one string correct and everything else as a failure should safe us from many possible errors.

Finally, we’re outputting the status either as OK in green or the found status in red.

I’ve written this script several years ago. There are probably a few ways to improve it, but it’s been running without problems so far and I’m happy with it. Over the years I’ve learnt that not each single script has to be perfect as long as your overall composability is fine. If the script has a well-enough external interface (in this case its input arguments and stdout response) it can always be improved when needed.

I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.