Death, taxes, and hard drive failure. Hang around long enough and you're going to experience all three, it's just a fact of life. To help mitigate the risk of that last one we can use SnapRAID, a free open-source software (FOSS) solution used to create backups of your disk arrays.
I won't go into a detailed explanation of how it works, for now just know its main use case is for digital media libraries and it's not a straight replacement for traditional RAID. If you want to know more about the project check out their website, https://www.snapraid.it.
Now, having said that I can't actually recall ever personally experiencing a hard drive failure [knock on wood]. But like I said, sooner or later it's bound to happen. After setting up an OpenMediaVault (OMV) NAS with UnionFS and SnapRAID I ran the
snapraid smart command and was presented with this ominous report:
SnapRAID SMART report: Temp Power Error FP Size C OnDays Count TB Serial Device Disk ----------------------------------------------------------------------- 34 354 0 4% 10.0 xxxxxxxx /dev/sdc wd10tbB 31 2103 7 99% 4.0 xxxxxxxx /dev/sdd hd2 32 2103 0 6% 4.0 xxxxxxxx /dev/sde hd3 36 353 0 4% 10.0 xxxxxxxx /dev/sdb parity 0 - - SSD 0.0 - /dev/sda - The FP column is the estimated probability (in percentage) that the disk is going to fail in the next year. Probability that at least one disk is going to fail in the next year is 100%.
According to the report there's a 100% probability of a hard drive failure within a year. Looking at the failure probility (FP) column, we can see the culprit is
hd2. Death, taxes, and hard drive failure.
Now Backblaze, a backup service provider with enterprise and consumer solutions, have written a few articles on hard drive failure. The most important thing to note are the SMART values they use to predict hard drive failure. TL;DR - they use the SMART values below:
Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct PO--CK 099 099 010 - 584 187 Reported_Uncorrect -O--CK 093 093 000 - 7 188 Command_Timeout -O--CK 100 092 000 - 14 14 22 197 Current_Pending_Sector -O--C- 100 100 000 - 0 198 Offline_Uncorrectable ----C- 100 100 000 - 0
Admittedly it's been a few months since I discovered this impending failure, and if my memory serves me, the SMART 5 Reallocated Sector Count value hasn't increased. To further summarize the articles above, you want these values to be 0, and the more that are > 0 the worse your situation is. Here we see 3 values > 0.
Another interesting Backblaze article discusses the life expectancy of hard drives, spoiler alert, it's 6 years. Looking back at
hd2's SMART data we can see this drive has just about made it:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 9 Power_On_Hours -O--CK 043 043 000 - 50475
So, 50,475 hrs / 24 = 2,103 days / 365 = 5.76 years, not too shabby!
Somewhere I recall reading that if these values are stable and not increasing, you should be good. However with 3 bad values and a drive that has just about reached it's average life expectancy, it's a good time to replace it. Let's get back to the task at hand!
With SnapRAID your parity drive needs to have at least the capacity of the largest data drive in the array. Since I use SnapRAID as intended, for my media library, and the failing drive is only 4TB, on the smaller side by today's standards, it's a great time to increase the density of the replacement drive . As luck would have it a sale on the 12TB WD Easystore popped up, perfect timing! Now for the shuffling.
Since the new drive is now my largest it will have to become the parity drive. This means I'll have to replace the current 10TB parity drive with the new 12TB drive, then replace the failing 4TB drive with the former parity drive. How can we do all this without messing up the existing SnapRAID configuration? Luckily SnapRAID provides pretty clear instructions for replacing both a parity and data disk.
First off you need to run a
snapraid sync to ensure the parity information is up to date. Depending on the amount of data added since your last sync, this could take a while. While your array is sync'ing, grab your favorite schucking tool, install the drive, then format it an initialize the filesystem. Again, depending on the size of the drive this may take a bit.
Once those are finished, mount the new drive and copy the parity information over. This is as simple as copying the
snapraid.parity file to the new drive, per SnapRAID FAQ I used
ddrescue. As you can see from the output below, this is going to take a very long time assuming you have a lot of data on your media drives. I'm not sure how much faster another copy utility would be or if you even want to try as
ddrescue attempts to recover from any read errors and this parity data is very important as it will be used to rebuild your data if you suffer a disk failure.
~ % ddrescue /srv/dev-disk-by-label-wd10tbA/snapraid.parity /srv/dev-disk-by-label-wd12tbA/snapraid.parity ./ddrescue.log ipos: 8036 GB, non-trimmed: 0 B, current rate: 30867 kB/s opos: 8036 GB, non-scraped: 0 B, average rate: 93163 kB/s non-tried: 0 B, errsize: 0 B, run time: 23h 12m 13s rescued: 8036 GB, errors: 0, remaining time: n/a percent rescued: 100.00% time since last successful read: 0s Finished
ddrescue was finished I added the new parity drive to the SnapRAID configuration via the OMV UI and disabled the old parity. I ran
snapraid sync to ensure the array was healthy with the new parity drive, we can see that it recognized the parity drive swap.
~ % snapraid sync Self test... Loading state from /srv/dev-disk-by-label-wd10tbB/snapraid.content... UUID change for parity 'parity' from 'OLD-UUID-REDACTED' to 'NEW-UUID-REDACTED' Scanning disk wd10tbB... Scanning disk hd2... Scanning disk hd3... Using 949 MiB of memory for the file-system. Initializing... Resizing... Saving state to /srv/dev-disk-by-label-wd10tbB/snapraid.content... Saving state to /srv/dev-disk-by-label-hd2/snapraid.content... Saving state to /srv/dev-disk-by-label-hd3/snapraid.content... Verifying /srv/dev-disk-by-label-wd10tbB/snapraid.content... Verifying /srv/dev-disk-by-label-hd2/snapraid.content... Verifying /srv/dev-disk-by-label-hd3/snapraid.content... Syncing... Using 32 MiB of memory for 32 cached blocks. 100% completed, 3314 MB accessed in 0:00 wd10tbB 76% | ********************************************* hd2 3% | * hd3 0% | parity 0% | raid 2% | * hash 1% | * sched 15% | ********* misc 0% | |_____________________________________________________________ wait time (total, less is better) Everything OK Saving state to /srv/dev-disk-by-label-wd10tbB/snapraid.content... Saving state to /srv/dev-disk-by-label-hd2/snapraid.content... Saving state to /srv/dev-disk-by-label-hd3/snapraid.content... Verifying /srv/dev-disk-by-label-wd10tbB/snapraid.content... Verifying /srv/dev-disk-by-label-hd2/snapraid.content... Verifying /srv/dev-disk-by-label-hd3/snapraid.content...
Now that the array has the new parity disk and SnapRAID reports "Everything OK", it's time to move on to replacing the failing data disk with the old 10 TB parity drive. Make sure to wipe the drive and initalize a file system like before. Mount the drive and simply copy all the data from the failing drive to the replacement. SnapRAID's example uses the
cp -av command but mentions you can use rsync 3.1.0 or newer. I opted to use rsync as it can report progress information. Be sure to provide the archive option
-a as this will preserve symlinks, ownership, permissions, and other important attributes. Since I know this will be a long running task, I use
nohup ... & so the process will continue to run in the background even if I log out.
nohup rsync -aP /srv/dev-disk-by-label-hd2/. /srv/dev-disk-by-label-wd10tbA &
Once the data has been moved, update the SnapRAID config via the OMV UI again and run
snapraid diff to ensure the the copy was successful. Any differences listed here are likely due to errors while copying the data or changes made to the array since the last sync. Once you're confident the data is intact, you're finished! Congratulations!
Now that we're finshed I want to review 2 things that I learned about SnapRAID while upgrading my array. First are the drive names in the SnapRAID configuration. When I created the array I just used the drive labels as the names, this is probably not the best strategy as the name is independant from the drive label, and seems to be immutable, or at least not easily changed. Adding the new parity drive was not an issue as an array can have more than one parity, in fact this increases the duribility of your array giving it the ability to recover from multiple drive failures. Ultimately I backed out the original parity to use for data, this is where I discovered the naming issue. I removed
hd2 from the array and added
snapraid diff was reporting a missing
hd2 disk. The solution was to add the
wd10tbA drive to the array but name it
hd2 in the SnapRAID config. In the future it will be better to give this more ambigous names to avoid confusion, or give them a more purposeful name like
Another PEBKAC was the scheduling of the
scrub maintence tasks. In the SnapRAID OMV UI there is a "Diff Script Settings" that I configured to run a daily sync and weekly scrub. However when checking the array status it didn't appear that either was running as I configured. Eventually I created my own scheduled job of
snapraid touch; snapraid sync. Then I realized you must hit the "Scheduled diff" button, then save, this will create a daily scheduled task for you. Initially I thought the Scheduled diff menu was just to configure email preferences. Now you're certainly not required to use the provided diff script, you can gain more control by scheduling your own jobs, but at least that mystery is solved.