Measuring the Information Content of Data, Part II

In Monday’s post, I wrote about how to measure how much information there is in a chunk of data. The key metric is Shannon entropy. For a given chunk of data, such as a document, the Shannon entropy is defined as:

It helps to break this definition into two parts: the first part, pi(x), is the probability a given character x appears in the message. For example, in the message “HELLO, WORLD”, the character H appears once. The message contains 12 characters (including the space and comma). So, the probability of H appearing in this message is 1/12. Likewise, the character L appears 3 times, so pi(L) is 3/12.

Now let’s turn to the second part of the definition: log2(pi(x)). In Monday’s post, we saw that it takes log2(x) many bits to encode x distinct characters. The term log2(pi(x)) tells us the same thing: however, since it takes the probability of x as an input, instead of the length of the message, it returns a negative value instead of a positive one (that is, log2(4) = 2, while log2(1/4) = -2). Since entropy is a positive quantity, the definition begins with a minus sign, to negate the negative quantity this introduces.

(It may seem that introducing a negative quantity only to negate it just adds complexity. However, by using the probability in the second part of the definition, instead of the message length, the entropy calculation only needs one input, instead of two).

Let’s recap: the first part of the definition, pi(x), is the probability of a given character appearing in the message. The second part, log2(pi(x)), is how many bits it takes to encode a message of length 1/pi(x). Their product is the number of bits needed given to encode a character x, given that x appears with probability pi(x). Formally, this is the expected number of bits needed to encode x in the given message.

Now, the summation sign tells us to (1) perform this calculation for every unique character in the message, and (2) sum the results. The final result is the Shannon entropy of the message. For “HELLO, WORLD”, then, we have the following terms:

Character Probability Nbr. of Bits Expected Bits
L 3/12 1.585 0.396
O 2/12 1.000 0.167
the rest 1/12 3.585 0.299

There are 7 characters in “the rest”, so the total Shannon entropy is 2.656. In practice, we can’t work with “part of a bit”, so we round up to 3.

I mentioned at the end of Monday’s post that we can improve on a naive encoding scheme because all characters aren’t created equal. Shannon entropy shows this can be done when a message contains some characters that repeat (e.g. L, in “HELLO, WORLD”). If a character only appears once, the number of expected bits to encode it is the naive estimate: there’s no gain in this case. However, most messages have some repeat characters, and for these, an encoding scheme can improve on this estimate. In fact, the more often a character is repeated, the greater the improvement.

Finally, Shannon entropy also has an analogue in the physical world. At the atomic level, every physical system consists of a finite number of atoms, each in a finite state (upon measurement). It is logically consistent to consider the system’s entropy as the amount of information it contains, with a perfectly ordered system containing no information, and a perfectly disordered system containing the maximum amount of information.

For all such systems, the Copenhagen interpretation of quantum mechanics treats the system’s waveform as a probability distribution of the possible states it can realize: as the system evolves, the available information increases, and these probabilities update. Once maximum disorder is reached, no further information is available: the system has exhausted what it can produce, and what we can know about it.

Measuring the Information Content of Data, Part I

In physics, entropy measures how much of a system’s energy is no longer available for productive work. The entropy of a closed system increases whenever one part of it does work on another part, owing to friction and other causes. As a result, without additional energy or work from the outside, every closed system, from a car engine to a nuclear reactor, will eventually stop being productive.

Data science has its own version of entropy, known as Shannon entropy. It is named after Claude Shannon, a Bell Labs alum who founded the mathematical discipline of information theory. Information theory deals with questions such as whether there exists an encryption algorithm that is truly unbreakable (yes: the one-time pad), and how quickly you need to sample an audio recording to reproduce it perfectly for the human ear (44.1kHz is sufficient, and is used by CDs, iTunes, and Spotify).

Shannon entropy measures how much information a block of data contains. To see how it works, imagine encoding the value of a coin toss. There are two outcomes (heads and tails), so you just need to store a “0” if the outcome is heads, and a “1” if the outcome is tails. In computer science, this type of variable is called a bit. You can use a series of bits to store a sequence of coin tosses: for example, (heads, heads, tails, heads) could be stored as “0010”.

To encode an event with more than two outcomes, you need more than just one bit. Note that two bits can store four distinct values: the encodings are “00”, “01”, “10”, and “11”. Likewise, three bits can store eight distinct values: “000”, “001”, “010”, “011”, “100”, “101”, “110”, “111”. In the general case, the number of bits needed to encode x distinct values is log2(x). The logarithm is taken with respect to the base 2 because each individual bit can encode exactly two values.

Using multiple bits lets us store more complex information: for example, 5 bits can store 32 distinct values, which is enough for the English alphabet and a few punctuation marks. A series of 5-bit chunks can store English sentences, such as “HELLO, WORLD”, and “WE HOLD THESE TRUTHS TO BE SELF EVIDENT: THAT ALL MEN ARE CREATED EQUAL”. This last sentence has 59 characters (57 letters, and 2 punctuation marks). If each character is encoded with 5 bits, the total sentence requires 295 bits. Or does it?

Not necessarily. This is important because, in practice, these bits need to be stored somewhere. And while hard drives keep getting larger, they can’t begin to keep up with the sheer volume of available data. Using the smallest possible encoding is also crucial for web browsers and streaming services (such as YouTube) that must transfer large amounts of information over a fixed bandwidth.

So, can we do better? Yes. In fact, the values calculated above are an upper-bound: a sort of worse-case scenario. Shannon entropy improves on this estimate, often by quite a bit. It does so by assuming that, unlike Jefferson’s dictum, all sentences are not created equal. We’ll see what that means, and how the calculation works, in Thursday’s post.