Tuesday 21 June 2016

Another one bites the dust ...

My PC has been going steadily downhill for the past week or two, until finally at the weekend it plummeted off the cliff's edge into the deep blue.

The symptoms were confusing: it would freeze up randomly, sometimes thawing and sometimes slowing to a crawl but occasionally becoming totally unresponsive, requiring a reboot. There were no error messages, at least none that I noticed. The Windows error log was no help, and there was no obvious pattern to it. I couldn't pin it on any specific app or situation - even reverting a recent software update on an app that I run 24x7 made no difference. There are no reported viruses. The PC isn't overheating and the mains supply is reliable.

Well, with 20/20 hindsight, there were some little clues about the underlying cause. Saving fairly large files sometimes took a bit longer than normal due to the PC pausing for breath in mid-save. MP3 music would sometimes stutter, endlessly repeating a few seconds like a scratched vinyl record or a very talented parrot. 

Defragmenting the disks with the Windows built-in function or with Piriform's Defraggler utility (which, I suspect, is just a pretty user interface layered on top of the self same Windows tools) didn't help, and in fact one of the disks refused to defrag fully ... so, suspecting a disk problem, I tried CHKDSK and took a look at the S.M.A.R.T. reporting. Neither seemed to indicate anything wrong, although the S.M.A.R.T. parameters aren't exactly simple to understand. I guess I'm just not sufficiently familiar with the normal values to spot something out of the ordinary. Does any of this look bad to you?


For instance, the S.M.A.R.T. 'Read Error Rate' on this particular disk has a 'real value' of zero, but a 'Current' value of 200 which appears to be the worst (worst ever, I guess, presumably for the lifetime of the drive) ... and a 'Threshold' value of 51. So is the 200 or zero good news or bad? There is no red flag, nothing but the merest hint that the drive might perhaps be about to leap off a handy cliff. The other S.M.A.R.T. parameters are just as confusing. Have I really only power-cycled this disk 7 times? Somehow I doubt that. I'm just not smart enough for S.M.A.R.T.]

The final clue to my failing hardware came when Word steadfastly refused to open a large (250-page) document that I had been working on lately. The file looked normal in Explorer but Word stopped opening it about a third of the way through the progress bar, complaining that it was corrupted.

Oh oh.

Naturally, being an infosec pro and 'professionally paranoid', I have multiple backups of the disk in question, the most recent being about a week ago (I really should sort out a daily backup regime!) ... so I decided the best approach was to try to copy the dying disk contents to another drive and then hopefully restore any failed/corrupted files from backups. The disk copy started OK but a few minutes later the errors started coming thick-n-fast. Having selected 'skip' to ignore the corrupted files, the PC did its best to copy the remainder. Judging by the bursts of normal-speed copying interspersed with slow or dead-slow periods, the problem seemed to afflict various parts of the disk differently (perhaps a head crash or misalignment?). It turned what is normally a 30 minute job into an all-day marathon.

That was yesterday. Today I've been checking the transferred fiiles and recovering a few obviously missing ones from backups. All the apps I have tried so far seem to work fine, and my blood pressure is heading back down towards the normal range.

While looking for another disk to replace the dodgy one, I checked through our heap of SATA drives on the side, setting aside those marked "DEAD" or "DYING" in bold red marker pen. I now have two dead Western Digital 250Gb SATA 2 drives and two dead Seagate 1Tb SATA 3 drives: all have expired within the past couple of years or so. The Seagate Barracudas were particularly disappointing, failing much sooner than I expected for no obvious reason other than poor quality manufacturing, so I won't be buying or recommending Seagate ever again. The WD Caviar drives have done rather better, lasting about 7-9 years in daily use. The take-home message for me is not to expect more than about 5 years' life out of a decent disk, even less from Seagate.

Today I ordered two new WD Black 1 Tb SATA 3 drives. The "Black edition" drives are evidently tested more thoroughly than the blues, and come with a 5 year manufacturer's warranty - although it is a 'return to base' warranty which the cynic in me suspects means I'm paying extra for the dubious privilege of being a WD beta tester. I also ordered an external USB3 double disk caddy that can make disk-to-disk copies, presumably bit-copies or sector-wise duplication: that will come in handy to make belt-n-braces backups of my backup disks to store offline in the fire safe. Like I said, I'm professionally paranoid!

Aside from new hardware and, hopefully, a daily backup regime, I need to put more effort into monitoring and hopefully understanding those S.M.A.R.T. parameters. Maybe I ought to investigate RAID for its real-time data duplication, perhaps even cloud storage if our rural broadband can take the strain? A policy of retiring disks before they hit 5 years of age makes sense too, for the primary data disks anyway. Windows and apps can always be reloaded. The data are what counts. There's gold in them thar bits.

The real trick, of course, is to use this security incident and make those improvements. It's all very well me blogging about them: it amounts to nothing but good intentions unless I follow-through and complete the corrective actions. Hopefully, though, this little case-study has made you contemplate your own situation, your controls, your disks, data and backups, especially if you are a small business like us or a power user at home. You get the benefit without the costs and the blood pressure spike. Watch and learn.

No comments:

Post a Comment

The floor is yours ...