Tuesday 21 August 2012

Breaking Sod's Law

A news piece about a failure of the 911 emergency phone services during a storm is valuable security awareness lesson for any organization that relies on a generator-backed UPS to maintain clean, stable power to essential ICT services - meaning practically every large organization and many smaller ones too.

Reading somewhat between the lines, the Fairfax County 911 ICT services depended on a UPS, the batteries in which were supposed to be kept fully charged by two generators in case the mains power failed.  One of the generators worked fine but the other evidently failed to auto-start.  Insufficient capacity in the one running generator meant that the UPS batteries gradually ran down over a few hours until eventually the UPS ran out of juice, the lights went out and the 911 ICT services failed.

Moments after the initial power cut, the technician/s on site were probably relieved that the change-over from mains input to generator input had gone smoothly, and the ICT services carried on running normally.  It is a stressful event.  However, it seems they relaxed too much or perhaps lacked the information to realize that the second generator was not working, and hence the UPS batteries were gradually discharging.

Speaking from personal experience as a former ICT manager, 'running on UPS' is an unusual situation that requires unusual activities.  Whenever the sites I managed experienced power cuts during the normal working day, I relied on the site maintenance people - our wizards - to ensure that the electrical equipment was working correctly, leaving me to worry about the peripheral ICT equipment and services that may not be on UPS, for example, figuring out which bits of kit had failed or might fail soon, and which parts of the business might be affected.  When power cuts happened out-of-hours, there was generally less pressure since the business was relatively quiet but at the same time the wizards were often unavailable, so I've been faced with unfamiliar blinkenlights and contingency situations, leaving me little choice but to muddle through and hope for the best.

Thankfully that haphazard, high-risk approach was good enough at the time but would be quite inappropriate for a critical 24x7 operation ... such as the ICT supporting 911 emergency services to nearly 2.3 million people ...

Verizon had tested (and presumably passed!) the standby power system just three days before the storm - so why did it go wrong on the big day?  According to the Washington Post article, "At the Arlington site, the routine and limited testing had not checked whether a generator could carry a full power load in an emergency".  Oops.  The Post doesn't say why Verizon's testing was limited, but these are the kinds of reasons (justifications or excuses?) I have come across in the course of hundreds of IT installation audits elsewhere:
  • "Since the power system was professionally designed and thoroughly tested when it was installed, routine testing is unnecessary" [wrong!  Loading changes, equipment wears out, batteries lose their capacity ...];
  • "It was tested 2 or 3 years ago" [see above - and in this particular case, it turned out the previous testing was so limited as to be pathetically inadequate]
  • "Full testing is too costly" [unanticipated power incidents can be far costlier];
  • "Limited/offline testing is sufficient" [it's useful for some but not all checks, and could even be counterproductive since lightly-loaded generators may accumulate partially-burnt fuel];
  • "The power fails every few weeks and whenever it does, the backup power has worked fine - so there's no point in testing it" [testing is an opportunity to check things more thoroughly and if appropriate push things to the limit e.g. simulating extended power outages on full load];
  • "We follow the equipment suppliers' recommendations" [strangely enough, when I trotted out the predictable line "Show me", nobody could lay their hands on the mythical guidance documents];
  • "Full testing is too risky so we only do very limited testing out of hours" [a scary response: management was clearly afraid their critical backup systems would fail the tests, meaning they lacked the necessary assurance to be confident they would work when actually needed for real, meaning significant business risks were not being properly treated].
Aside from the obvious stuff such as excellent power engineering, automated failovers, over-capacity, proper equipment maintenance, procedures and full on-load testing, there are other ways of reducing unacceptable power risks:
  • More than barely adequate funding, in other words treating the complete power system as a vital infrastructure investment; 
  • Proper instrumentation, allowing power supplies and loads to be monitored continuously and projected accurately in relation to power system capacity, with suitable alarms and alerts triggering response procedures when the readings head into the amber (don't wait for them to go red - or worse still go out altogether!).  Adequate voltage, current and power metering is hardly rocket-surgery, while temperature monitoring (including the use of thermography) can tell an experienced power engineer a lot about the state of the plant and switchgear;
  • Independent power system audits by competent, experienced assessors;
  • Productive working relationships between the facilities people, IT people, site and information security people, power people and business people, including the suppliers of specialist UPS, generator and other equipment and, of course, the lines companies and power suppliers.
Backups and contingency arrangements are needed because of Sod's Law or Murphy's Law. Trouble is, backups and contingency arrangements are subject to exactly the same laws (remember how the tsunami flooded the emergency backup generators at Fukushima?).  And so are the tests, by the way.  The trick is to do whatever it takes to make sure the systems will pass their tests with flying colors and have solid contingency plans just in case they don't.

No comments:

Post a Comment

The floor is yours ...