Crowdstrike - post-incident review: a dozen learning points

Walkway through a swamp


















I blogged about the Crowdstrike incident on July 21st while it was still playing out. Now, having drained the swamp and let the dust settle, I'm due to draw out, deconstruct and decide what to do about the Crowdstrike disaster, so here goes:

  1. Design, build and test systems for resilience, where 'systems' means not just IT systems but the totality of interdependent technologies, organisations, people, information flows and other resources necessary to deliver and support critical business activities.

    Hinson tip: "be prepared" is not just for boy scouts! Those dependencies are potential pinch plus pain points.

  2. Test software before release. Sounds easy, right? It isn't. There is an infinite amount of testing that could be performed, only a fraction of which realistically should be, while the amount and quality of testing actually performed is resource-constrained and time-boxed for business and uncertainty (risk!) reasons (delaying security changes can increase uncertainty/risk). Deciding when 'enough is enough' is a tricky context-dependent uncertainty (risk!) management decision based on available information (some of dubious value or misleading) and criteria that are barely understood, let alone documented. Uncertainties (risks!) inevitably remain.

    Hinson tip: try testing your test tactics and tools against a Crowdstrike-type incident, not so much to reduce the uncertainty (risk!) of this particular incident recurring but more generally to think through various failure scenarios associated with software developments, releases, implementations, patches or other system changes.

  3. Test patches on receipt before implementation on critical systems or (for urgent security updates) during phased deployment. I'll say it again: testing is tricky. Things may still turn to custard or go tits-up.

    Hinson tip: given that software implementation and change is inherently uncertain (risky!), risk avoidance and mitigating controls can prove invaluable, which means preparing for the worst despite hoping for the best. In particular, it is worth developing the ability to monitor, slow/stop and reverse-out changes that don't go well ... so ...

  4. Specify, develop and test your resilience, recovery and restitution processes. Yes, since it involves testing, this is fractal or self-referential.

    Hinson tip: there are various limits and uncertainties (risks!) inherent to the testing here too, and as with planning and plans, resilience is more about the journey than the destination. It's a capability, a cultural characteristic and character.

  5. Proactively manage uncertainty (risk!) relating to or involving information, such as identifying which business activities and 'systems' are critical (see #1), and taking the appropriate actions to understand and tackle unacceptable uncertainties (risks!). Although generic, this requires good governance and strong, clued-up go-getting leadership.

    Hinson tip: business impact analysis, a core part of business continuity management, is an excellent, structured approach for management to distinguish critical from non-critical.

  6. Organisations - not just nations - have critical infrastructure, things on which they are utterly dependent, and which inevitably have multiple points or modes of failure. 

    Hinson tip: eliminating single points of failure is necessary but not sufficient unless they are all truly eliminated - an unlikely situation.
     
  7. Soon after a major incident strikes, a tsunami of information starts streaming with a multitude of messages manifesting through multiple media and modes of communication. Information security is relevant here too: some of the available information is meaningful, valid, pertinent, useful and valuable ... but some isn't. Some is misleading, manipulative or malicious. Although it is impossible to control information flowing in the public domain, it is worth paying attention to the internal corporate information, including management information, information disseminated throughout the workforce, instructions passed to pertinent functions, departments, teams or individuals, and any information disclosed or revealed to third parties.

    Hinson tip: it is good practice to provide 'out-of-band' alternative comms mechanisms in case the usual email, phone and messaging services are out of action. For real finesse, add the default capability for workers to act sensibly on their own intiative if all comms are down, or if the messages coming through are incoherent or distinctly dubious.

    Hinson tip: major incidents have sociological effects. Frankly, I'm not sure what can be done about this, other than to acknowledge and be ready for the possibility of shock and awe ... which in turn increases the possibility (risk!) of follow-on malicious exploitation and mischief, and community responses which vary across communities.

  8. Manage incidents competently, responding efficiently and effectively in real time. IT, uncertainty (risk!) and security, cybersecurity and incident response professionals (leaders especially) should be capable of rising professionally above any developing crisis, calmly and confidently offering clear guidance, sensible direction and support for the teams battling the fires and feeding relevant information to various stakeholders and onlookers.

    Hinson tip: doing all that requires preparation, practice, proof and particular personalities. It's harder than it seems: coping well with crisis is a special talent.

  9. Playbooks are practically pointless ... whereas purposely preparing them is on-point. It's the same issue with incident scenarios, use cases, test plans, uncertain (risky!) situations etc.: these revolve around examples as a way to focus attention and remain grounded in reality, whereas real life is seldom so neat.

    Hinson tip: if you are doggedly determined or resigned to continue preparing playbooks, one way to address this is to pointedly and purposely if philosophically propose and practice a special 'something else happens' playbook scenario, suggesting a contingency approach: since we are dealing with imperfect information about infinite future possibilities meaning uncertainty (risk!), what we can and should do when things to go wrong is partly contingent on what actually occurs, and what resources we can command at that point.

  10. Insurance will, at best, only cover part of the losses arising from incidents. Confidently calculating or counting costs is complex and challenging, especially in advance but also following major incidents. Uncertainty (risk!) and incident management will, at best, reduce but not eliminate uncertainty (risk!). An uncertain amount of residual uncertainty (risk!) is inevitable ... so suck it up! Build and maintain sufficient reserves to get through whatever transpires.

    Hinson tip: reducing uncertainty (risk!) reduces the need for those contingency reserves, releasing them for more productive purposes. In other words, it saves money and can generate competitive advantage.

  11. While some are still busy with the post-incident hose-down and wash-up, we are mostly now in the pre-incident 'left of boom' phase: the next incident is around the corner. We don't know exactly what will go bang, nor how or when or how loud it will be, but we can be certain 'something bad' is going to occur ... or may be happening right now. A realistic (neither pessimistic nor optimistic) assessment of that uncertainty (risk!) should lead to appropriate responses, involving information, insight, innovation and investment.

    Hinson tip: vendors touting their products as 'solutions' to the Crowdstrike incident may alienate the market and be accused of 'ambulance chasing'. I'm aware that I could be accused likewise for writing and releasing this article ... but I assure you I have benevolent intent. Whether you believe me, or those ambo-chasing vendors, is for you to decide - especially where a 'solution' involves spending money. It's both an ethical and a business matter.

  12. Unless changes are actually made as a result of an incident, the uncertainties (risks) remain. We have missed out on a valid learning and improvement opportunity.

    Hinson tip: it's all very well me blabbering on 'bout bits in my blog but blagging the resources to make real progress on any of this is down to you, dear reader. Good luck. Be brave. Go big.