Measuring the Information Content of Data, Part II

In Monday’s post, I wrote about how to measure how much information there is in a chunk of data. The key metric is Shannon entropy. For a given chunk of data, such as a document, the Shannon entropy is defined as:

It helps to break this definition into two parts: the first part, pi(x), is the probability a given character x appears in the message. For example, in the message “HELLO, WORLD”, the character H appears once. The message contains 12 characters (including the space and comma). So, the probability of H appearing in this message is 1/12. Likewise, the character L appears 3 times, so pi(L) is 3/12.

Now let’s turn to the second part of the definition: log2(pi(x)). In Monday’s post, we saw that it takes log2(x) many bits to encode x distinct characters. The term log2(pi(x)) tells us the same thing: however, since it takes the probability of x as an input, instead of the length of the message, it returns a negative value instead of a positive one (that is, log2(4) = 2, while log2(1/4) = -2). Since entropy is a positive quantity, the definition begins with a minus sign, to negate the negative quantity this introduces.

(It may seem that introducing a negative quantity only to negate it just adds complexity. However, by using the probability in the second part of the definition, instead of the message length, the entropy calculation only needs one input, instead of two).

Let’s recap: the first part of the definition, pi(x), is the probability of a given character appearing in the message. The second part, log2(pi(x)), is how many bits it takes to encode a message of length 1/pi(x). Their product is the number of bits needed given to encode a character x, given that x appears with probability pi(x). Formally, this is the expected number of bits needed to encode x in the given message.

Now, the summation sign tells us to (1) perform this calculation for every unique character in the message, and (2) sum the results. The final result is the Shannon entropy of the message. For “HELLO, WORLD”, then, we have the following terms:

Character Probability Nbr. of Bits Expected Bits
L 3/12 1.585 0.396
O 2/12 1.000 0.167
the rest 1/12 3.585 0.299

There are 7 characters in “the rest”, so the total Shannon entropy is 2.656. In practice, we can’t work with “part of a bit”, so we round up to 3.

I mentioned at the end of Monday’s post that we can improve on a naive encoding scheme because all characters aren’t created equal. Shannon entropy shows this can be done when a message contains some characters that repeat (e.g. L, in “HELLO, WORLD”). If a character only appears once, the number of expected bits to encode it is the naive estimate: there’s no gain in this case. However, most messages have some repeat characters, and for these, an encoding scheme can improve on this estimate. In fact, the more often a character is repeated, the greater the improvement.

Finally, Shannon entropy also has an analogue in the physical world. At the atomic level, every physical system consists of a finite number of atoms, each in a finite state (upon measurement). It is logically consistent to consider the system’s entropy as the amount of information it contains, with a perfectly ordered system containing no information, and a perfectly disordered system containing the maximum amount of information.

For all such systems, the Copenhagen interpretation of quantum mechanics treats the system’s waveform as a probability distribution of the possible states it can realize: as the system evolves, the available information increases, and these probabilities update. Once maximum disorder is reached, no further information is available: the system has exhausted what it can produce, and what we can know about it.

Measuring the Information Content of Data, Part I

In physics, entropy measures how much of a system’s energy is no longer available for productive work. The entropy of a closed system increases whenever one part of it does work on another part, owing to friction and other causes. As a result, without additional energy or work from the outside, every closed system, from a car engine to a nuclear reactor, will eventually stop being productive.

Data science has its own version of entropy, known as Shannon entropy. It is named after Claude Shannon, a Bell Labs alum who founded the mathematical discipline of information theory. Information theory deals with questions such as whether there exists an encryption algorithm that is truly unbreakable (yes: the one-time pad), and how quickly you need to sample an audio recording to reproduce it perfectly for the human ear (44.1kHz is sufficient, and is used by CDs, iTunes, and Spotify).

Shannon entropy measures how much information a block of data contains. To see how it works, imagine encoding the value of a coin toss. There are two outcomes (heads and tails), so you just need to store a “0” if the outcome is heads, and a “1” if the outcome is tails. In computer science, this type of variable is called a bit. You can use a series of bits to store a sequence of coin tosses: for example, (heads, heads, tails, heads) could be stored as “0010”.

To encode an event with more than two outcomes, you need more than just one bit. Note that two bits can store four distinct values: the encodings are “00”, “01”, “10”, and “11”. Likewise, three bits can store eight distinct values: “000”, “001”, “010”, “011”, “100”, “101”, “110”, “111”. In the general case, the number of bits needed to encode x distinct values is log2(x). The logarithm is taken with respect to the base 2 because each individual bit can encode exactly two values.

Using multiple bits lets us store more complex information: for example, 5 bits can store 32 distinct values, which is enough for the English alphabet and a few punctuation marks. A series of 5-bit chunks can store English sentences, such as “HELLO, WORLD”, and “WE HOLD THESE TRUTHS TO BE SELF EVIDENT: THAT ALL MEN ARE CREATED EQUAL”. This last sentence has 59 characters (57 letters, and 2 punctuation marks). If each character is encoded with 5 bits, the total sentence requires 295 bits. Or does it?

Not necessarily. This is important because, in practice, these bits need to be stored somewhere. And while hard drives keep getting larger, they can’t begin to keep up with the sheer volume of available data. Using the smallest possible encoding is also crucial for web browsers and streaming services (such as YouTube) that must transfer large amounts of information over a fixed bandwidth.

So, can we do better? Yes. In fact, the values calculated above are an upper-bound: a sort of worse-case scenario. Shannon entropy improves on this estimate, often by quite a bit. It does so by assuming that, unlike Jefferson’s dictum, all sentences are not created equal. We’ll see what that means, and how the calculation works, in Thursday’s post.

Cloudera’s Virtual Hadoop Cluster, Part II: Creating a Data Lake

On Monday, I wrote about how to install and setup Cloudera’s Virtual Hadoop Cluster, which is named the Cloudera Quickstart VM. After the cluster is up and running, the next step is to build a data lake so that you can import data to analyze. You can find guidance on how to do that at Part II: Building a Basic Data Lake for the Cloudera Quickstart VM. If you haven’t finished installing the Quickstart VM, you can find a step-by-step guide here.

Both Part I and Part II have a contact form at the bottom: if you get stuck, feel free to reach out, and I’ll do my best to help.

Try Out a Virtual Hadoop Cluster at Home for Free

One of the great things about Hadoop is that it’s open source: anyone can download the packages from the official Hadoop website and install them (brave souls can even look at the source code). That said, getting the different Hadoop services, such as the filesystem (HDFS), resource manager (YARN), and data processor (MapReduce), to work together often takes some work. And this work increases if you want Hive, HBase, and the other dozen or so services that form the core processing engine of most real-world clusters.

Because of this, a handful of companies now bundle these services as a single, proven package. As I write this, Cloudera and HortonWorks share most of the market, although MapR is gaining ground fast (no doubt because their own proprietary file system has significant advantages over Hadoop’s native file system). Cloudera, in particular, also has a freely available virtual Hadoop cluster that uses their Cloudera Manager platform, which means that those interested in Hadoop can learn how it works without any upfront cost.

To that end, I’ve written a two-part series on getting started with the Cloudera Manager Virtual Cluster (named the Cloudera Quickstart VM). Both posts use Talend Open Studio, a highly popular data management tool that is also free to download. In the first post, I show how to connect to the Cloudera Quickstart VM using Talend Open Studio, while in the second post, I show how to begin creating a Hadoop data lake, and populating it with data. You can find the first part of the series at the link below; I’ll post the second part on Thursday.

Happy programming!

Part I: Downloading Cloudera Quickstart and Getting Connected

Big Data’s File Drawer Problem

In social science, the File Drawer Problem refers to the fact that negative results are very rarely published. The problem is named for the hypothetical “file drawer” that every researcher has, where she keeps the failed experiments and research that never panned out. Having worked in the academic world, I can confirm these file drawers, both literal and figurative, do exist.

The reason this is problematic is that, for an article to be published, it has to show its results are statistically significant. This is because statistical models can’t be proven right: they can only be proven to be not wrong with a given probability. Typically, the threshold to get published is 95% significance, which gives a 1 in 20 chance the model appears significant, but in fact isn’t.

The File Drawer Problem comes in because this threshold assumes that researchers are equally likely to publish positive results that meet the threshold, and negative results that do not. However, a researcher who conducts 20 experiments will, on average (and all else equal), find one is significant just from chance. If she only sends that one article off for publication, and puts the other 19 in her file drawer, then her findings aren’t actually valid: other researchers can’t rely on them. And if she’s not the only one (and there’s good evidence that she isn’t), then a sizeable part of the accepted knowledge in her discipline simply isn’t true.

Several methods have been proposed to mitigate this problem, from requiring a higher level of significance, such as 99%, to doing away with significance levels altogether and using other forms of validation. However, no one has proposed updating a model’s findings after it has been accepted for publication. That’s where Big Data comes in.

In her recent book, Weapons of Math Destruction, Cathy O’Neil argues that machine learning models can become destructive when they don’t follow up their conclusions to see if they were, in fact, correct. Admittedly, this can be a challenge: a model that reads resumes to select the best candidates can’t follow the people who weren’t selected and evaluate their performance at their next job. Often, however, people just don’t take the time. The result, O’Neil argues, is these models create their own self-reinforcing version of the truth. In the terms I’ve used here, Big Data also has a File Drawer Problem.

One solution O’Neil offers is to build this type of validation into the model’s design. She notes that models like the Fair-Isaac credit score (FICO) are continually updated with new information. If the FICO model predicts someone is a good credit risk, and they end up defaulting, it can update its algorithm. This isn’t perfect: someone rejected as a poor credit risk likely won’t have the chance to disprove the model, because no one will lend to him. Still, updating the model where possible helps keep its predictions in line with what actually happens: it mitigates, to some extent, the File Drawer Problem.

This modeling framework could also work in social science. In theory, researchers are supposed to test each other’s models with new data to see how well they perform. In practice, this rarely happens, both because this type of research is difficult to publish, and because it can potentially make someone powerful enemies. One alternative is for the researchers themselves to post a brief follow-up study comparing the model’s predictions to what actually happened. This would show the model’s strengths and weaknesses, and could also inform future research (though care would be needed here to avoid building a model solely to predict the observed data).

Will this happen? Probably not: “publish or perish” is more true than ever, and no one is going to risk tenure by publicizing flaws in their own work. However, something like this needs to be done if social science is going to compete in a marketplace of ideas that includes Big Data. Although the two have very difficult cultures, they share the same goal of making sense of the world. And it goes without saying that sense is something today’s world dearly needs.