Posted: 2023-01-08 21:15:00 by Alasdair Keyes
After my failed drive on New Year's day, I ordered a new disk and rebuilt the array. Thankfully due to monthly checking by the OS, all the data on the three remaining drives was readable and the array is complete again.
I put the failed disk into another machine and ran the following badblocks command on it.
badblocks -o sdb_badblocks.txt -b 4096 -w -s /dev/sdb
I used the destructive test as the data was not needed now that the array was back to full strength. Incidentally, using a block size of 4096 over the default 1024 seemed to provide about a 2x-3x speed increase.
Even with that, the 2TB disk took just over 33 hours for a full write pass and a confirmation read pass.
At the end of it, a full write and read pass were managed with no errors reported. This is frustrating as mdadm had obviously detected a read error to reject the disk - this was logged in syslog.
I thought that maybe the bad sectors had been remapped by the firmware during the badblocks test, but checking the SMART stats again I saw that no errors are reported and also no re-allocation had been logged (ID# 5 below).
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 134 134 054 Pre-fail Offline - 103
3 Spin_Up_Time 0x0007 168 168 024 Pre-fail Always - 342 (Average 311)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 75
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 146 146 020 Pre-fail Offline - 29
9 Power_On_Hours 0x0012 086 086 000 Old_age Always - 99078
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 75
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 989
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 989
194 Temperature_Celsius 0x0002 200 200 000 Old_age Always - 30 (Min/Max 16/44)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
So I'm not sure why the error occurred. Maybe the controller is bad, the cable is dodgy or my server got hit by some stray cosmic rays somewhere and caused some kind of CRC error (yes, it can and does happen). The server has ECC memory so bitflips in the RAM should have been detected had they occurred.
Interestingly, this is the first failed disk I've had within a Linux MDADM array in over 20 years of running servers (I've had plenty of failed disks in Dell PERC controllers and whatever controllers Supermicro jam into their servers!). All previous arrays have been torn down before a disk failed.
As such this was also the first time I've had to rebuild an array. This particular RAID was running for over 11 years before this disk failed. For those interested, I followed this post by Redhat about the steps to take https://www.redhat.com/sysadmin/raid-drive-mdadm.
Should something similar happen again, I think I would run badblocks in non-destructive mode on the disk in situ, then if it passed push it back into the array for it to be rebuilt before I looked at buying a new disk.
If you found this useful, please feel free to donate via bitcoin to 1NT2ErDzLDBPB8CDLk6j1qUdT6FmxkMmNz
© Alasdair Keyes
I'm now available for IT consultancy and software development services - Cloudee LTD.
Happy user of Digital Ocean (Affiliate link)