Tuesday 10 January 2023

Two dozen data centre fire controls


Fire is clearly a significant risk to any data centre given that a major incident (disaster!) is reported globally roughly every quarter year on average plus an unknown number of smaller/unreported ones. Limited public disclosure of data centre fire investigation reports makes it tough, even for experienced professionals, to assess and quantify the risk.  However, s
ince the likely impacts and costs of such major incidents are obviously non-trivial and the number of incidents is definitely not zero, it would be negligent to ignore the risks.

Controls to avoid, mitigate or share data centre/IT facility fire risks include:
  1. Governance and management arrangements taking due account of information risks including physical security aspects when designing and procuring information services such as commercial cloud services and data centre/co-location facilities - which, by the way, don't automatically reduce the risks: hopefully they benefit from the professional engineering, high quality maintenance and operations appropriate to large-scale IT installations, so expect to pay accordingly.  

  2. Sustainable power and environmental management such as a policy to procure energy-efficient equipment and account for power consumption;

  3. Strategically-chosen locations avoiding environmental risks such as floods, earthquakes, bush fires, social unrest and war zones;

  4. Geographically-diverse multi-site resilient load-sharing installations, implying the need for fast, reliable and secure data communications to keep them all in-step plus a shed-load of systems engineering;

  5. Taking an holistic systems-engineering approach to the specification, architecture, design, installation, use, management and maintenance of the facilities and services, taking due account of all the associated information risks, security/control, business and compliance requirements (e.g. ensuring clear access routes for fire-fighters, physical access and egress controls with emergency overrides);

  6. Physically isolated internal zones using fire barriers such as slab-to-slab partition walls, plus floors, ceilings, doors, ducts and cabling, all appropriately fire-rated;

  7. Electrical isolation of zones with the capability to cut primary and backup power quickly to allow firefighting and reduce the spread;

  8. Full-flood fire suppression of IT facilities using carbon dioxide, water mist and permitted extinguishant chemicals;

  9. A strict 24x7 ban on smoking in/near computer facilities (obviously!) plus extreme care with other sources of ignition such as welding and plumbing using gas torches, ‘temporary’ extension cables and adapters, overloaded racks, dust build-up on fans and filters, and old equipment generally - plus HR and management controls to mitigate the arson threat from insiders and intruders;

  10. Appropriate fire alarm/power supply interlocks e.g. powering-down air conditioners early in the fire alarm/response sequence to avoid fanning the flames, and triggering controlled but rapid server shutdowns before power is cut;

  11. Proactively removing or replacing flammable materials in or near the computer facilities, especially volatile solvents, paper, cardboard and plastics – such as paint, computer manuals, backup tapes, ordinary PVC-covered cables and spare/stored IT equipment;

  12. Special isolation, fire detection, containment and suppression arrangements for particularly high-risk elements such high-energy-density batteries in uninterruptible power supplies, laptops, IoT things and other devices/equipment, and archives containing invaluable/irreplaceable information assets;

  13. Improved fire detection e.g. high sensitivity aspirating systems with local and remote/monitored alarms, combining in-rack heat/smoke detectors and others strategically placed in the facility, with regular competent testing ... and thermal imaging to identify hotspots such as overloaded power cables, switches, power supplies or racks;

  14. Rapid, effective, well-rehearsed fire responses e.g. fire training for workers including security guards and engineers/maintenance people, not just fire crews, with suitable policies, procedures, awareness, training and exercises, perhaps even on-site fire stations, firefighters and equipment similar to airports and industrial areas;

  15. Secure off-site backups plus proven recovery arrangements, especially for systems that have not been professionally engineered for resilience and for critical people;

  16. Appropriate insurance cover and related controls (Hinson tip: this is just one of these two dozen controls, and is not a panacea!);

  17. Effective service, business and life continuity management (e.g. resilience engineering, high-availability systems, redundant/diverse dual-live/failover and disaster recovery arrangements, first aiders) with professional engineering, installation, operations, monitoring, maintenance and change-controls plus appropriate assurance measures;

  18. Appropriate assurance measures such as periodic and ad hoc exercises and tests, fire/safety/maintenance site inspections and audits, supplier assessments etc. implying the need for clear criteria/requirements, competent assessors and senior management support for corrective actions;

  19. Power consumption/heat load and temperature monitoring, hinting at the value of certain physical security metrics plus concerns around high density equipment racks;

  20. Explicit roles and responsibilities, plus accountability, such that all parties (for there are many) are fully aware of and formally accept their parts in managing the risks and their liabilities if they fail;

  21. Change management of all physical aspects, with sufficient documentation and authorisation for significant changes, including incremental changes that cross threshold points for power, temperature etc.;
     
  22. Proactive physical risk management, for example tracking fire research as state of the art progresses in this area, learning lessons from data centre fires elsewhere, taking professional advice and generally going beyond minimalist building codes as appropriate;

  23. Awareness, training, supervision and oversight for all workers on-site, since inept or unfortunate workers are a prevalent cause of fires. Don’t forget that people are extremely valuable yet vulnerable assets, hence health and safety/welfare qualifies as an information security control as well as a legal and ethical obligation. As a simple example, designated ‘fire points’ (with the appropriate types of extinguishers for use by suitably-trained people) should ideally be located near protected fire exits, not tucked away deep inside the facilities such that anyone brave/foolhardy enough to attempt to fight a fire with a hand-held extinguisher (if permitted as a policy matter) has ready access to a clear escape route, first and foremost;

  24. Sharing of information about fire risks, controls and incidents, adoption of good practices defined in standards, codes of practice, advisories etc., perhaps even participation in research and development studies. Patently, there is a lot to consider here so, as I said, take advice from competent professional experts (not me! This blog piece is just a heads-up! I am an information risk and security specialist, not a fire expert!).
If this all seems very costly, yes, you're right, especially with today's hyper-scale data centres piling eggs high in ever larger baskets ... but consider the alternative: if the computer facilities, equipment, data and perhaps personnel go up in flames, splattering the news headlines, what would that cost your organisation? Inadequate physical risk management could turn out to be an existential failure.


PS  ISO/IEC 27002:2022 control 7.5 "Protecting against physical and environmental threats" recommends that:
"Protection against physical and environmental threats, such as natural disasters and other intentional or unintentional physical threats to infrastructure should be designed and implemented ... to prevent or reduce the consequences of events originating from physical and environmental threats."

That advice plus about half a page of notes elaborating on the control is basic and would be woefully inadequate for many real-world data centres. For instance, 'designing and implementing' the controls neglects to mention equally important aspects such as their operation, management, monitoring, maintenance and assurance. The standard is a starting point. The same point applies to information security controls in general: '27002 is not a detailed configuration and implementation guide, merely an introduction. The brief control summaries in ISO/IEC 27001 Annex A are barely even titles!  

No comments:

Post a Comment

The floor is yours ...