Measuring the Information Content of Data, Part I

In physics, entropy measures how much of a system’s energy is no longer available for productive work. The entropy of a closed system increases whenever one part of it does work on another part, owing to friction and other causes. As a result, without additional energy or work from the outside, every closed system, from a car engine to a nuclear reactor, will eventually stop being productive.

Data science has its own version of entropy, known as Shannon entropy. It is named after Claude Shannon, a Bell Labs alum who founded the mathematical discipline of information theory. Information theory deals with questions such as whether there exists an encryption algorithm that is truly unbreakable (yes: the one-time pad), and how quickly you need to sample an audio recording to reproduce it perfectly for the human ear (44.1kHz is sufficient, and is used by CDs, iTunes, and Spotify).

Shannon entropy measures how much information a block of data contains. To see how it works, imagine encoding the value of a coin toss. There are two outcomes (heads and tails), so you just need to store a “0” if the outcome is heads, and a “1” if the outcome is tails. In computer science, this type of variable is called a bit. You can use a series of bits to store a sequence of coin tosses: for example, (heads, heads, tails, heads) could be stored as “0010”.

To encode an event with more than two outcomes, you need more than just one bit. Note that two bits can store four distinct values: the encodings are “00”, “01”, “10”, and “11”. Likewise, three bits can store eight distinct values: “000”, “001”, “010”, “011”, “100”, “101”, “110”, “111”. In the general case, the number of bits needed to encode x distinct values is log2(x). The logarithm is taken with respect to the base 2 because each individual bit can encode exactly two values.

Using multiple bits lets us store more complex information: for example, 5 bits can store 32 distinct values, which is enough for the English alphabet and a few punctuation marks. A series of 5-bit chunks can store English sentences, such as “HELLO, WORLD”, and “WE HOLD THESE TRUTHS TO BE SELF EVIDENT: THAT ALL MEN ARE CREATED EQUAL”. This last sentence has 59 characters (57 letters, and 2 punctuation marks). If each character is encoded with 5 bits, the total sentence requires 295 bits. Or does it?

Not necessarily. This is important because, in practice, these bits need to be stored somewhere. And while hard drives keep getting larger, they can’t begin to keep up with the sheer volume of available data. Using the smallest possible encoding is also crucial for web browsers and streaming services (such as YouTube) that must transfer large amounts of information over a fixed bandwidth.

So, can we do better? Yes. In fact, the values calculated above are an upper-bound: a sort of worse-case scenario. Shannon entropy improves on this estimate, often by quite a bit. It does so by assuming that, unlike Jefferson’s dictum, all sentences are not created equal. We’ll see what that means, and how the calculation works, in Thursday’s post.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *