In Monday’s post, I wrote about how to measure how much information there is in a chunk of data. The key metric is Shannon entropy. For a given chunk of data, such as a document, the Shannon entropy is defined as:
It helps to break this definition into two parts: the first part, pi(x), is the probability a given character x appears in the message. For example, in the message “HELLO, WORLD”, the character H appears once. The message contains 12 characters (including the space and comma). So, the probability of H appearing in this message is 1/12. Likewise, the character L appears 3 times, so pi(L) is 3/12.
Now let’s turn to the second part of the definition: log2(pi(x)). In Monday’s post, we saw that it takes log2(x) many bits to encode x distinct characters. The term log2(pi(x)) tells us the same thing: however, since it takes the probability of x as an input, instead of the length of the message, it returns a negative value instead of a positive one (that is, log2(4) = 2, while log2(1/4) = -2). Since entropy is a positive quantity, the definition begins with a minus sign, to negate the negative quantity this introduces.
(It may seem that introducing a negative quantity only to negate it just adds complexity. However, by using the probability in the second part of the definition, instead of the message length, the entropy calculation only needs one input, instead of two).
Let’s recap: the first part of the definition, pi(x), is the probability of a given character appearing in the message. The second part, log2(pi(x)), is how many bits it takes to encode a message of length 1/pi(x). Their product is the number of bits needed given to encode a character x, given that x appears with probability pi(x). Formally, this is the expected number of bits needed to encode x in the given message.
Now, the summation sign tells us to (1) perform this calculation for every unique character in the message, and (2) sum the results. The final result is the Shannon entropy of the message. For “HELLO, WORLD”, then, we have the following terms:
|Character||Probability||Nbr. of Bits||Expected Bits|
There are 7 characters in “the rest”, so the total Shannon entropy is 2.656. In practice, we can’t work with “part of a bit”, so we round up to 3.
I mentioned at the end of Monday’s post that we can improve on a naive encoding scheme because all characters aren’t created equal. Shannon entropy shows this can be done when a message contains some characters that repeat (e.g. L, in “HELLO, WORLD”). If a character only appears once, the number of expected bits to encode it is the naive estimate: there’s no gain in this case. However, most messages have some repeat characters, and for these, an encoding scheme can improve on this estimate. In fact, the more often a character is repeated, the greater the improvement.
Finally, Shannon entropy also has an analogue in the physical world. At the atomic level, every physical system consists of a finite number of atoms, each in a finite state (upon measurement). It is logically consistent to consider the system’s entropy as the amount of information it contains, with a perfectly ordered system containing no information, and a perfectly disordered system containing the maximum amount of information.
For all such systems, the Copenhagen interpretation of quantum mechanics treats the system’s waveform as a probability distribution of the possible states it can realize: as the system evolves, the available information increases, and these probabilities update. Once maximum disorder is reached, no further information is available: the system has exhausted what it can produce, and what we can know about it.