Monitoring RAID Health with the i3 Window Manager
I think I’ve been using this setup for a few years already, but I have never written it down. So let’s fix that. When you’re using a RAID to handle HDD failures (e.g. RAID 1 or RAID 5) you need some way to get notified about hardware failures. The RAID system doesn’t help anything at all if one disk fails, you think everything’s OK, because the system still works, and then the second disk fails.
To my knowledge the built-in way with mdadm
is to get a mail notification.
Sending mails from my home system would not be my first choice, so I wanted
another solution.
I’m using the i3 window manager with i3blocks in the status bar. i3blocks
allows me to run external scripts to display some information in the status
bar. I’ve added the following block to my i3blocks.conf
:
[raid]
label=RAID
command=$HOME/.dotfiles/i3/i3blocks/raid localhost md127
interval=600
markup=pango
This does the following things: It gives the block a label RAID
, that’s just
text that is always displayed. Every 10 minutes (600 seconds) it executes the
script $HOME/.dotfiles/i3/i3blocks/raid
with arguments localhost
and md127
and displays the output of that command in the status bar, too. pango
allows
me to return HTML from the script to display colors.
I’m setting localhost
for the host, because my script is generic enough to
also read the RAID status of other systems. Let’s have a look at the script,
it’s quite simple as well.
#!/usr/bin/env bash
host=$1
device=$2
if [[ -z "$host" ]]; then
echo '<span color="red">missing host</span>'
elif [[ -z "$device" ]]; then
echo '<span color="red">missing device</span>'
fi
if [[ "$host" == "localhost" ]]; then
stat=$(cat /proc/mdstat | grep -A1 $device | awk '/blocks/ {print $NF}')
else
stat=$(ssh "$host" "cat /proc/mdstat | grep -A1 $device | awk '/blocks/ {print \$NF}'")
fi
if [[ "$stat" == "[UU]" ]]; then
echo '<span color="#00FF00">OK</span>'
else
echo '<span color="red">'$stat'</span>'
fi
First, we’re fetching the input parameters and perform some validation on them.
If we’re on the localhost, we can look at the output of /proc/mdstat
directly,
otherwise we connect to the remote host via SSH. I’m using an SSH key without
password for this.
In /proc/mdstat
we’re looking for a line containing the device name, e.g.
md127
and the following line. One of the two lines (surprise, the second line
of course, otherwise we wouldn’t have fetched it in grep
) contains the term
block
and we’re looking for the last column in that line.
All this line parsing sounds a bit complex, what if something goes wrong? We’re
expecting the last column to equal the string [UU]
(I have RAID1 with 2 disks
everywhere at the moment). If we find anything else, we’re treating the RAID
failed. I think considering exactly one string correct and everything else as a
failure should safe us from many possible errors.
Finally, we’re outputting the status either as OK in green or the found status in red.
I’ve written this script several years ago. There are probably a few ways to improve it, but it’s been running without problems so far and I’m happy with it. Over the years I’ve learnt that not each single script has to be perfect as long as your overall composability is fine. If the script has a well-enough external interface (in this case its input arguments and stdout response) it can always be improved when needed.
I do not maintain a comments section. If you have any questions or comments regarding my posts, please do not hesitate to send me an e-mail to blog@stefan-koch.name.