“If it ain’t broken, don’t fix it”… or should I?

For a whole week (from the 2nd to the 8th of December), when visiting my web site, you got this message instead:

Due to a disk failure, the yeti.selfip.net domain is currently running on a back-up server, and some features are not fully functional.

I will try to restore this site to its normal state as soon as possible. Please excuse the delay.

Here are some explanations.

Cet article est aussi disponible en français.

Everything began with the replacement of the hard disk on the server… My server is a SheevaPlug eSata, on which was plugged a Cooler Master X-Craft 350 external enclosure (the eSata model), with a Samsung HD103UJ “Spinpoint F1 Sata 1TB” hard disk inside.

I had been using the Samsung hard disk for three years, when at last I decided to enable smartd on the server. Then I saw that this hard disk had a stable temperature of 65°C (with passive cooling, provided by the aluminium casing); this is a high temperature, all the more since this hard disk had probably been running at that temperature for three years, the age at which hard disks’ failure rate tend to grow.

Better safe than sorry, thought I. So on October the 15th, I bought a new hard disk, that perfectly suited my needs: the Western Digital WD30EFRX “RED Sata 3TB”. Thanks to LVM, moving all the data from one disk to the other went easily, by plugging both of them on the Sata connectors of the desktop PC. It was quickly done, and soon the server was back up, with the new hard disk. Thanks to LVM again, the change of hard disk went unnoticed and the server brought each service up as if nothing had happened.

The choice of hard disk proved to be a good one: this new hard disk is faster and more silent, its power consumption is lower, and its temperature was 37°C (and even less now; read further).

One month after the disk was changed, I saw strange lines in the log files:

Nov 19 20:55:39 server2 kernel: [2427574.318700] ata2: exception Emask 0x10 SAct 0x0 SErr 0x100000 action 0x6 frozen
Nov 19 20:55:39 server2 kernel: [2427574.326317] ata2: edma_err_cause=00000020 pp_flags=00000000, SError=00100000
Nov 19 20:55:39 server2 kernel: [2427574.333594] ata2: SError: { Dispar }
Nov 19 20:55:39 server2 kernel: [2427574.337370] ata2: hard resetting link
Nov 19 20:55:39 server2 kernel: [2427574.843240] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
Nov 19 20:55:39 server2 kernel: [2427574.883438] ata2.00: configured for UDMA/133
Nov 19 20:55:39 server2 kernel: [2427574.887911] ata2: EH complete

I could not swear that such log lines had never shown up before, as I previously did not check my log files as much as I do now. These lines worried me since they seemed to relate to the new hard disk. So I opened a case by Western Digital’s warranty service. A week later, they told me to check my disk using their Windows-only Data Lifeguard tool (even though I had clearly described my Linux-only configuration).

So, for this disk-check only, I installed Windows on my desktop PC, and then I installed their tool. I turned the server off, then I took the hard disk enclosure and switched it to USB mode, and plugged it to the desktop PC running Windows. The tool immediately saw the hard disk, but it displayed meaningless data, such as 746MiB for the hard disk capacity!
Seeing that, I refrained from running any diagnostic test; instead I switched everything off immediately (though cleanly). But the deed was done, and this misinformed disk layout, as displayed by WD’s tool on Windows, had become the new reality: as I plugged the hard disk on the Sata connector of the desktop PC and rebooted the latter to Linux, gdisk (utility for hard disks with a GPT partition table) warned me that the 3TB LVM partition was too big for this “746MiB hard disk”…

Many a reading later, my quest for a solution to this issue got a hint from hardforum.com: the expert command e in the gdisk tool. The article about this tool is worth reading too.
But then —to my dismay— even though the hard disk was “repaired” and seemed to work flawlessly inside the desktop PC, the server did not even register its presence when the hard disk went back to its proper place!

Over the week, I brought back up the services from the “yeti.selfip.net” domain, bit by bit, on the desktop PC; I was not able to bring back the database, though, which accounted for the absence of this blog.

By chatting here and there and reading, I came to the conclusion that —praying for the well-being of the server itself— the issue had to be with the external eSata enclosure, or with the eSata cable. I bought an Icy Box IB-351StU3S-B, and thus replaced both the enclosure and the cable. Incidentally, this was again a good choice, as the hard disk temperature in that enclosure dropped from 37°C to 32°C.

On Sunday, the 8th of October, I was able to put the Western Digital hard disk in its new external enclosure, and the server went back up without a problem.

So, problem solved? I'm not certain…

I keep getting blocks of errors in my log files:

server2:~# { zcat $(ls -tr /var/log/syslog*.gz); cat /var/log/syslog{.?,}; } | grep -iE 'kernel|smart'
Nov 30 11:06:05 server2 smartd[2520]: Device: /dev/sda [SAT], previous self-test completed without error
Nov 30 11:36:05 server2 smartd[2520]: Device: /dev/sda [SAT], offline data collection was suspended by an interrupting command from host (auto:on)
Dec  1 02:06:05 server2 smartd[2520]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
Dec  1 02:36:05 server2 smartd[2520]: Device: /dev/sda [SAT], offline data collection was aborted by an interrupting command from host (auto:on)
Dec  1 02:36:05 server2 smartd[2520]: Device: /dev/sda [SAT], previous self-test completed without error
Dec  2 02:06:05 server2 smartd[2520]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
Dec  2 02:36:05 server2 smartd[2520]: Device: /dev/sda [SAT], previous self-test completed without error
Dec  2 20:11:40 server2 smartd[2520]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD30EFRX_68EUZN0-WD_WMC4N0515205.ata.state
Dec  2 20:11:40 server2 smartd[2520]: smartd is exiting (exit status 0)
… SERVER STOPPED
Dec  8 11:44:06 server2 kernel: [   22.603521] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
Dec  8 11:44:06 server2 kernel: [   22.643549] ata2.00: ATA-9: WDC WD30EFRX-68EUZN0, 80.00A80, max UDMA/133
Dec  8 11:44:06 server2 kernel: [   22.650282] ata2.00: 5860533168 sectors, multi 0: LBA48
Dec  8 11:44:06 server2 kernel: [   22.693552] ata2.00: configured for UDMA/133
Dec  8 11:44:06 server2 kernel: [   22.698150] scsi 1:0:0:0: Direct-Access     ATA      WDC WD30EFRX-68E 80.0 PQ: 0 ANSI: 5
Dec  8 11:44:06 server2 kernel: [   22.727805] sd 1:0:0:0: [sda] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
Dec  8 11:44:06 server2 kernel: [   22.736645] sd 1:0:0:0: [sda] Write Protect is off
Dec  8 11:44:06 server2 kernel: [   22.741458] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
Dec  8 11:44:06 server2 kernel: [   22.741558] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec  8 11:44:06 server2 kernel: [   29.381431]  sda: sda1
Dec  8 11:44:06 server2 kernel: [   29.385231] sd 1:0:0:0: [sda] Attached SCSI disk
Dec  8 11:44:06 server2 kernel: [   29.399252] sd 1:0:0:0: Attached scsi generic sg0 type 0
…
Dec  8 11:44:25 server2 smartd[2548]: Configuration file /etc/smartd.conf parsed.
Dec  8 11:44:25 server2 smartd[2548]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Dec  8 11:44:25 server2 smartd[2548]: Device: /dev/sda [SAT], opened
Dec  8 11:44:25 server2 smartd[2548]: Device: /dev/sda [SAT], WDC WD30EFRX-68EUZN0, S/N:WD-WMC4N0515205, WWN:5-0014ee-0ae5404f3, FW:80.00A80, 3.00 TB
Dec  8 11:44:25 server2 smartd[2548]: Device: /dev/sda [SAT], not found in smartd database.
Dec  8 11:44:25 server2 smartd[2548]: Device: /dev/sda [SAT], enabled SMART Attribute Autosave.
Dec  8 11:44:25 server2 smartd[2548]: Device: /dev/sda [SAT], Read SMART Thresholds failed, ignoring -f Directive
Dec  8 11:44:25 server2 smartd[2548]: Device: /dev/sda [SAT], enabled SMART Automatic Offline Testing.
Dec  8 11:44:25 server2 smartd[2548]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Dec  8 11:44:25 server2 smartd[2548]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.WDC_WD30EFRX_68EUZN0-WD_WMC4N0515205.ata.state
Dec  8 11:44:25 server2 smartd[2548]: Monitoring 1 ATA and 0 SCSI devices
… SERVER RESTARTED
Dec  8 12:14:26 server2 smartd[2553]: Device: /dev/sda [SAT], old test of type L not run at Sat Dec  7 03:00:00 2013 CET, starting now.
Dec  8 12:14:26 server2 smartd[2553]: Device: /dev/sda [SAT], starting scheduled Long Self-Test.
Dec  8 12:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 120 to 117
Dec  8 12:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], self-test in progress, 90% remaining
Dec  8 13:14:26 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 117 to 116
Dec  8 18:14:26 server2 smartd[2553]: Device: /dev/sda [SAT], self-test in progress, 10% remaining
Dec  8 20:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 116 to 118
Dec  8 20:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], previous self-test completed without error
Dec  9 02:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
Dec  9 02:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], previous self-test completed without error
Dec  9 20:04:03 server2 kernel: [116440.764192] ata2: sata_mv: attempting PIO w/multiple DRQ: this may fail due to h/w errata
Dec  9 20:04:24 server2 kernel: [116461.572774] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec  9 20:04:24 server2 kernel: [116461.579952] ata2.00: failed command: SMART
Dec  9 20:04:24 server2 kernel: [116461.584190] ata2.00: cmd b0/d5:01:e1:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Dec  9 20:04:24 server2 kernel: [116461.584197]          res 40/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec  9 20:04:24 server2 kernel: [116461.598968] ata2.00: status: { DRDY }
Dec  9 20:04:24 server2 kernel: [116461.602752] ata2: hard resetting link
Dec  9 20:04:25 server2 kernel: [116462.112741] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
Dec  9 20:04:25 server2 kernel: [116462.192764] ata2.00: configured for UDMA/133
Dec  9 20:04:25 server2 kernel: [116462.197182] ata2: EH complete
Dec  9 20:04:25 server2 kernel: [116462.246113] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec  9 20:04:25 server2 kernel: [116462.254808] ata2.00: failed command: FLUSH CACHE EXT
Dec  9 20:04:25 server2 kernel: [116462.260194] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Dec  9 20:04:25 server2 kernel: [116462.260200]          res 58/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Dec  9 20:04:25 server2 kernel: [116462.274836] ata2.00: status: { DRDY DRQ }
Dec  9 20:04:25 server2 kernel: [116462.279199] ata2: hard resetting link
Dec  9 20:04:25 server2 kernel: [116462.792727] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
Dec  9 20:04:25 server2 kernel: [116462.872744] ata2.00: configured for UDMA/133
Dec  9 20:04:25 server2 kernel: [116462.877129] ata2.00: retrying FLUSH 0xea Emask 0x2
Dec  9 20:04:25 server2 kernel: [116462.892767] ata2: EH complete
Dec 10 02:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
Dec 10 02:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], previous self-test completed without error
Dec 10 15:07:26 server2 kernel: [185041.101938] ata2: sata_mv: attempting PIO w/multiple DRQ: this may fail due to h/w errata
Dec 10 15:07:46 server2 kernel: [185061.410546] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 10 15:07:46 server2 kernel: [185061.417724] ata2.00: failed command: SMART
Dec 10 15:07:46 server2 kernel: [185061.421962] ata2.00: cmd b0/d5:01:e1:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Dec 10 15:07:46 server2 kernel: [185061.421968]          res 40/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 10 15:07:46 server2 kernel: [185061.436740] ata2.00: status: { DRDY }
Dec 10 15:07:46 server2 kernel: [185061.440526] ata2: hard resetting link
Dec 10 15:07:47 server2 kernel: [185061.950516] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
Dec 10 15:07:47 server2 kernel: [185062.030535] ata2.00: configured for UDMA/133
Dec 10 15:07:47 server2 kernel: [185062.034955] ata2: EH complete
Dec 10 15:08:07 server2 kernel: [185082.409996] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 10 15:08:07 server2 kernel: [185082.417176] ata2.00: failed command: SMART
Dec 10 15:08:07 server2 kernel: [185082.421413] ata2.00: cmd b0/d5:01:e0:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Dec 10 15:08:07 server2 kernel: [185082.421420]          res 40/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 10 15:08:07 server2 kernel: [185082.436197] ata2.00: status: { DRDY }
Dec 10 15:08:07 server2 kernel: [185082.439988] ata2: hard resetting link
Dec 10 15:08:08 server2 kernel: [185082.949847] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
Dec 10 15:08:08 server2 kernel: [185083.029871] ata2.00: configured for UDMA/133
Dec 10 15:08:08 server2 kernel: [185083.034291] ata2: EH complete
Dec 10 15:08:13 server2 kernel: [185088.560383] ata2: sata_mv: attempting PIO w/multiple DRQ: this may fail due to h/w errata
Dec 10 15:08:34 server2 kernel: [185109.369032] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 10 15:08:34 server2 kernel: [185109.376207] ata2.00: failed command: SMART
Dec 10 15:08:34 server2 kernel: [185109.380448] ata2.00: cmd b0/d5:01:e1:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Dec 10 15:08:34 server2 kernel: [185109.380455]          res 40/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 10 15:08:34 server2 kernel: [185109.395236] ata2.00: status: { DRDY }
Dec 10 15:08:34 server2 kernel: [185109.399025] ata2: hard resetting link
Dec 10 15:08:35 server2 kernel: [185109.908992] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
Dec 10 15:08:35 server2 kernel: [185109.989016] ata2.00: configured for UDMA/133
Dec 10 15:08:35 server2 kernel: [185109.993437] ata2: EH complete
Dec 10 15:16:23 server2 kernel: [185578.574842] ata2: sata_mv: attempting PIO w/multiple DRQ: this may fail due to h/w errata
Dec 10 15:16:44 server2 kernel: [185599.393488] ata2: limiting SATA link speed to 1.5 Gbps
Dec 10 15:16:44 server2 kernel: [185599.398848] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 10 15:16:44 server2 kernel: [185599.406051] ata2.00: failed command: SMART
Dec 10 15:16:44 server2 kernel: [185599.410265] ata2.00: cmd b0/d5:01:e1:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Dec 10 15:16:44 server2 kernel: [185599.410271]          res 40/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 10 15:16:44 server2 kernel: [185599.425051] ata2.00: status: { DRDY }
Dec 10 15:16:44 server2 kernel: [185599.428826] ata2: hard resetting link
Dec 10 15:16:45 server2 kernel: [185599.933452] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl F310)
Dec 10 15:16:45 server2 kernel: [185600.013476] ata2.00: configured for UDMA/133
Dec 10 15:16:45 server2 kernel: [185600.017891] ata2: EH complete
Dec 10 15:16:45 server2 kernel: [185600.055693] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 10 15:16:45 server2 kernel: [185600.062863] ata2.00: failed command: FLUSH CACHE EXT
Dec 10 15:16:45 server2 kernel: [185600.067989] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Dec 10 15:16:45 server2 kernel: [185600.067995]          res 58/00:46:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Dec 10 15:16:45 server2 kernel: [185600.082349] ata2.00: status: { DRDY DRQ }
Dec 10 15:16:45 server2 kernel: [185600.086606] ata2: hard resetting link
Dec 10 15:16:45 server2 kernel: [185600.603444] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl F310)
Dec 10 15:16:45 server2 kernel: [185600.683459] ata2.00: configured for UDMA/133
Dec 10 15:16:45 server2 kernel: [185600.687847] ata2.00: retrying FLUSH 0xea Emask 0x2
Dec 10 15:16:45 server2 kernel: [185600.703439] ata2.00: device reported invalid CHS sector 0
Dec 10 15:16:45 server2 kernel: [185600.708968] ata2: EH complete
Dec 10 16:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], offline data collection was suspended by an interrupting command from host (auto:on)
Dec 11 02:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
Dec 11 02:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], offline data collection was aborted by an interrupting command from host (auto:on)
Dec 11 02:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], previous self-test completed without error
Dec 11 11:12:08 server2 kernel: [257320.985847] ata2: sata_mv: attempting PIO w/multiple DRQ: this may fail due to h/w errata
Dec 11 11:12:29 server2 kernel: [257342.104433] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 11 11:12:29 server2 kernel: [257342.111612] ata2.00: failed command: SMART
Dec 11 11:12:29 server2 kernel: [257342.115849] ata2.00: cmd b0/d5:01:e1:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Dec 11 11:12:29 server2 kernel: [257342.115856]          res 40/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 11 11:12:29 server2 kernel: [257342.130630] ata2.00: status: { DRDY }
Dec 11 11:12:29 server2 kernel: [257342.134414] ata2: hard resetting link
Dec 11 11:12:30 server2 kernel: [257342.644411] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl F310)
Dec 11 11:12:30 server2 kernel: [257342.724475] ata2.00: configured for UDMA/133
Dec 11 11:12:30 server2 kernel: [257342.729004] ata2: EH complete
Dec 11 11:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], self-test in progress, 90% remaining
Dec 11 11:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 118 to 116
Dec 11 16:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], self-test in progress, 10% remaining
Dec 11 19:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 116 to 118
Dec 11 19:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], previous self-test completed without error
Dec 12 02:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
Dec 12 02:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], previous self-test completed without error
Dec 12 09:40:20 server2 kernel: [338210.157181] ata2: sata_mv: attempting PIO w/multiple DRQ: this may fail due to h/w errata
Dec 12 09:40:40 server2 kernel: [338230.525908] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 12 09:40:40 server2 kernel: [338230.533084] ata2.00: failed command: SMART
Dec 12 09:40:40 server2 kernel: [338230.537322] ata2.00: cmd b0/d5:01:e1:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Dec 12 09:40:40 server2 kernel: [338230.537329]          res 40/00:01:e0:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 12 09:40:40 server2 kernel: [338230.552102] ata2.00: status: { DRDY }
Dec 12 09:40:40 server2 kernel: [338230.555887] ata2: hard resetting link
Dec 12 09:40:41 server2 kernel: [338231.065763] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl F310)
Dec 12 09:40:41 server2 kernel: [338231.145787] ata2.00: configured for UDMA/133
Dec 12 09:40:41 server2 kernel: [338231.150212] ata2: EH complete
Dec 13 02:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
Dec 13 02:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], previous self-test completed without error
Dec 13 12:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 118 to 119
Dec 14 02:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
Dec 14 02:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], previous self-test completed without error
Dec 14 03:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], starting scheduled Long Self-Test.
Dec 14 03:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 119 to 116
Dec 14 03:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], self-test in progress, 90% remaining
Dec 14 04:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 116 to 115
Dec 14 09:14:26 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 115 to 116
Dec 14 09:14:26 server2 smartd[2553]: Device: /dev/sda [SAT], self-test in progress, 10% remaining
Dec 14 11:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 116 to 117
Dec 14 11:14:25 server2 smartd[2553]: Device: /dev/sda [SAT], previous self-test completed without error
Dec 14 11:44:25 server2 smartd[2553]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 117 to 118

On the one hand, many messages are worrying (in bold above), especially those in red; some of the messages seem to indicate that SMART is not working (which could actually be the case if the external enclosure does not support SMART; I did not check). On the other hand, the messages in green above show that SMART does work, at least to an extent, and also show that “everything is OK” (note that the temperature values are raw indicator values, not Celcius degrees)… And yet, I cannot find a way to read the indicators:

server2:~# smartctl --attributes /dev/sda
smartctl 5.41 2011-06-09 r3365 [armv5tel-linux-3.2.0-4-kirkwood] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Warning! SMART Attribute Data Structure error: invalid SMART checksum.
Error SMART Thresholds Read failed: scsi error aborted command
Smartctl: SMART Read Thresholds failed.

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 17018
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
255 Unknown_Attribute 0x373f 200 016 --- Pre-fail Always - 69269232549888
45 Unknown_Attribute 0x4d44 087 052 --- Old_age Offline - 55204041863235
53 Unknown_Attribute 0x0630 123 039 --- Old_age Offline - 53017881608307
65 Unknown_Attribute 0x3030 056 068 --- Old_age Offline - 53151365537879
70 Unknown_Attribute 0x5845 082 054 --- Pre-fail Offline - 53143143990573
32 Unknown_Attribute 0x2020 032 032 --- Old_age Offline - 35322350018592
32 Unknown_Attribute 0x2020 032 032 --- Old_age Offline - 550026354720
16 Unknown_Attribute 0x3f00 000 016 --- Old_age Offline - 280379760114684
255 Unknown_Attribute 0x000f 000 007 --- Pre-fail Always - 131943408599808
120 Unknown_Attribute 0x7800 000 000 --- Old_age Offline - 0
64 Unknown_Attribute 0xfe00 003 000 --- Old_age Offline - 56613598948096
105 Unknown_Attribute 0x4174 188 035 --- Old_age Offline - 239697196515175
224 Load_Friction 0x544e 174 243 --- Old_age Always - 4
255 Unknown_Attribute 0xbd00 239 000 --- Old_age Offline - 0

Checksum errors, unknown attributes, nonsensical raw values, everything seems to point to a byte order problem; I unfortunately do not know how to use the -v option of smartctl. Besides, I did not have this issue before; maybe it happened because I briefly used the hard disk on a different architecture, namely the x86_64 desktop PC…

Well, here I am. The server is working fine, but for how long? I'm interested in the slightest hint, as well as any explanation, by anyone knowing better than I do :-) Please accept my thanks in advance for your help!

Commentaires

1. Le vendredi 4 novembre 2016, 11:26 par Yves
I was kindly informed by e-mail that a new and interesting article on the longevity of hard drives is available at comparitech.com. It also happens to link to newer data from BackBlaze. Enjoy!

Ajouter un commentaire

Le code HTML est affiché comme du texte et les adresses web sont automatiquement transformées.

La discussion continue ailleurs

URL de rétrolien : http://yalis.fr/cms/index.php/trackback/30

Fil des commentaires de ce billet