Security Metric #53: entropy
Information Security Metric of the Week #53: entropy of encrypted content
Randomness is a crucial concept in cryptography. Aside from steganography, strongly encrypted information appears totally random with no discernible patterns or indicators that would give cryptanalysts clues to recover the original plaintext.
"Entropy" is a convenient term we're using here to describe a measure of randomness or uncertainty - we're being deliberately vague in order to avoid getting embroiled in the details of measuring or calculating this metric. And, to be frank, because Shannon goes way over our heads.
We envisage ACME using this metric (howsoever defined) to compare encryption systems or algorithms on a common basis, for instance when assessing new encryption products for use in protecting an extremely confidential database of pre-patent information. Faced with a shortlist of products, management seeks reassurance as to their suitability beyond the vendors' marketing hyperbole. The assessment process involves encrypting one or more specific data files with each of the systems or algorithms, then determining the randomness of the resulting ciphertexts using an appropriate mathematical calculation, or indeed several. For completeness, the calculations might be repeated using a variety of encryption keys in case any of the systems/algorithms has limitations in that respect. The ones that produce the most random ciphertext are the strongest encryption systems/algorithms. QED.
The PRAGMATIC ratings for this metrics are mostly high, apart from a glaring exception: Meaningfulness rated a pitiful 3% when the metric was assessed by ACME's management, since it appears Shannon went way over their heads too! The overall PRAGMATIC score of 59% would no doubt have been much higher if management understood the concept. In any case, the metric is of interest to ACME's IT and information security professionals involved directly in the product selection process, in other words this could be a worthwhile operational as opposed to management metric, even if the teccies need to explain the end result to their bosses, patiently, in terms of one syllable or less.
PS Luther Martin, writing in the May 2014 issue of ISSA Journal, discussed the percentage compression [such as that reported by WinZip] as a guide to the randomness of cyphertext.
PPS In the September 2016 issue of ISSA Journal, Luther (again) plus Tim Roake wrote about different definitions, meanings or measures of entropy, with various assumptions or prerequisites that can invalidate the calculations. The randomness of a data set reflects both (a) the frequencies of individual bits or digits or characters in the set and (b) the unpredictability or absence of pattern of the sequence. A binary sequence such as 11111111 does not appear random because it clearly has a marked 'excess' of 1s over 0s but despite its even frequencies, the sequence 10101010 is also probably not random since it has an obvious pattern, allowing us to predict future values. (a) is easy to measure, providing a relatively cheap and simple way to check whether supposedly strongly encrypted data are markedly biased. However, measuring or testing (b) is tricky, especially as 'patterns' may be quite obscure and complex. That pragmatic 'percentage compression' measure from WinZip is crude and insufficient for situations where randomness truly matters.
PS Luther Martin, writing in the May 2014 issue of ISSA Journal, discussed the percentage compression [such as that reported by WinZip] as a guide to the randomness of cyphertext.
PPS In the September 2016 issue of ISSA Journal, Luther (again) plus Tim Roake wrote about different definitions, meanings or measures of entropy, with various assumptions or prerequisites that can invalidate the calculations. The randomness of a data set reflects both (a) the frequencies of individual bits or digits or characters in the set and (b) the unpredictability or absence of pattern of the sequence. A binary sequence such as 11111111 does not appear random because it clearly has a marked 'excess' of 1s over 0s but despite its even frequencies, the sequence 10101010 is also probably not random since it has an obvious pattern, allowing us to predict future values. (a) is easy to measure, providing a relatively cheap and simple way to check whether supposedly strongly encrypted data are markedly biased. However, measuring or testing (b) is tricky, especially as 'patterns' may be quite obscure and complex. That pragmatic 'percentage compression' measure from WinZip is crude and insufficient for situations where randomness truly matters.