Giant Green Land Snails

A fascinating thread is being woven over on the ISO27k Forum, concerning information security risk analysis (RA) methods. Bob Ralph made a good point that set me thinking this morning: 
"... 'unknown knowns' should be scored the very highest until you know otherwise ... Of course the 'unknown unknowns' will not even be on the RA. But they are still about somewhere."
Biologists have developed techniques for estimating the unknown and answering awkward questions such as "How many Giant Green Land Snails are there on this island?" The obvious technique is to try to catch and count them all, but that's (a) costly, (b) disruptive for the snails and their ecosystem, and (c) not as accurate as you might think (snails are well camouflaged and duck for cover when biologists approach!). Capture-mark-recapture, also known as tag-and-release, is a more useful technique: catch all the snails you can find in a given area, mark or uniquely identify them in some way (preferably not with bright yellow blobs of paint that make them even easier targets for the Little Blue Snailcatcher bird that preys on Giant Green Land Snails!), then some time later repeat the exercise, this time counting the number of snails that are marked/identified or new. From that proportion, estimate the total population of Giant Green Land Snails in the area, and extrapolate across the entire island, taking various other factors into account (e.g. nesting areas for the Little Blue Snailcatchers, quantity and quality of habitats for the snails, snail lifetime, foraging range etc.). There are statistical techniques supporting this kind of method, and various other methods that give reasonable estimates, sufficient to answer the original and related questions, such as "Is the population shrinking or expanding?". I'm sure there are similar approaches in other fields of science - estimating the age/size of the universe for instance.

Go back to the last paragraph to swap "hackers" for Giant Green Land Snails, and "law enforcement" for biologists, and you have a technique for estimating the size (or, in fact, other characteristics) of the hacker population. It's "only" an estimate, but that is better than a pure guess since it is based on measuring/counting and statistics, i.e. it has a scientific, factual, reasonably repeatable and accurate basis. Douglas Hubbard's excellent book "How to measure anything" talks at length about the value of estimation, and (in some circumstances) even what one might call WAGs (wild-arse-guesses) - it's a very stimulating read.

So, I'm thinking about how to apply this to measuring information security risks, threats in particular. We have partial knowledge of the threats Out There (and, to be accurate, In Here too) gleaned from identified incidents that have been investigated back to the corresponding threats. There are other threats that are dormant or emerging, or that are so clever/lucky as to have escaped detection so far (Advanced Persistent Threats and others). There are errors in our processes for identifying and investigating incidents (meaning there are measurement risks - a chance that we will materially miscalculate things and over- or under-estimate), and a generalized secrecy in this field that makes it tricky to gather and share reliable statistics although some information is public knowledge, or is shared within trusted groups. But the overall lesson is that the problem of the "known and unknown unknowns" is not intractable: there are data, there are methods, there is a need to estimate threats, and it can be done.

One thing the scientists do but we information security bods don't (usually) is to calculate the likely errors associated with their numbers. So, in the snail example, the study might estimate that "There are 2,500 Giant Green Land Snails on the island, with a standard deviation of 850" or "We are 95% certain that the total population of Giant Green Land Snails on the island is between 420 and 735". There are numerous situations in information security in which errors or confidence limits could be calculated statistically from our data, but we very rarely (if ever!) see them in print - for instance, in survey-type studies where there are sufficient people or organizations in the pool for the statistics to work out (and with the right statistics, a surprisingly low minimum sample size may be sufficient, less than the 30 that used to be our rule of thumb). 

Speaking as a reformed (resting, latent, recuperating, ex-) scientist, current real-world information security practice is largely unscientific, outside of academic studies and journals anyway. Aside from the issue just mentioned, surveys and other data sources rarely explain their methods properly - for instance they may (if we're lucky) mention the sample size, or more often the number of respondents (a different parameter), but seldom are we told exactly how the sample was selected. With vendor-sponsored surveys, there is a distinct possibility that the sampling was far from random (e.g. they surveyed their own customers, who have patently expressed a preference for the vendor's products).  Small, stratified and often self-selected samples are the norm, as are implicit or explicit extrapolations to the entire world.

Consequently, for risk analysis purposes, we are often faced with using a bunch of numbers of uncertain vintage and dubious origins, with all manner of biases and constraints.  And surprise surprise we are often caught out by invalid assumptions. Ho hum.



PS  The snails and birds are pigments of my imagination.