Big Data’s File Drawer Problem

In social science, the File Drawer Problem refers to the fact that negative results are very rarely published. The problem is named for the hypothetical “file drawer” that every researcher has, where she keeps the failed experiments and research that never panned out. Having worked in the academic world, I can confirm these file drawers, both literal and figurative, do exist.

The reason this is problematic is that, for an article to be published, it has to show its results are statistically significant. This is because statistical models can’t be proven right: they can only be proven to be not wrong with a given probability. Typically, the threshold to get published is 95% significance, which gives a 1 in 20 chance the model appears significant, but in fact isn’t.

The File Drawer Problem comes in because this threshold assumes that researchers are equally likely to publish positive results that meet the threshold, and negative results that do not. However, a researcher who conducts 20 experiments will, on average (and all else equal), find one is significant just from chance. If she only sends that one article off for publication, and puts the other 19 in her file drawer, then her findings aren’t actually valid: other researchers can’t rely on them. And if she’s not the only one (and there’s good evidence that she isn’t), then a sizeable part of the accepted knowledge in her discipline simply isn’t true.

Several methods have been proposed to mitigate this problem, from requiring a higher level of significance, such as 99%, to doing away with significance levels altogether and using other forms of validation. However, no one has proposed updating a model’s findings after it has been accepted for publication. That’s where Big Data comes in.

In her recent book, Weapons of Math Destruction, Cathy O’Neil argues that machine learning models can become destructive when they don’t follow up their conclusions to see if they were, in fact, correct. Admittedly, this can be a challenge: a model that reads resumes to select the best candidates can’t follow the people who weren’t selected and evaluate their performance at their next job. Often, however, people just don’t take the time. The result, O’Neil argues, is these models create their own self-reinforcing version of the truth. In the terms I’ve used here, Big Data also has a File Drawer Problem.

One solution O’Neil offers is to build this type of validation into the model’s design. She notes that models like the Fair-Isaac credit score (FICO) are continually updated with new information. If the FICO model predicts someone is a good credit risk, and they end up defaulting, it can update its algorithm. This isn’t perfect: someone rejected as a poor credit risk likely won’t have the chance to disprove the model, because no one will lend to him. Still, updating the model where possible helps keep its predictions in line with what actually happens: it mitigates, to some extent, the File Drawer Problem.

This modeling framework could also work in social science. In theory, researchers are supposed to test each other’s models with new data to see how well they perform. In practice, this rarely happens, both because this type of research is difficult to publish, and because it can potentially make someone powerful enemies. One alternative is for the researchers themselves to post a brief follow-up study comparing the model’s predictions to what actually happened. This would show the model’s strengths and weaknesses, and could also inform future research (though care would be needed here to avoid building a model solely to predict the observed data).

Will this happen? Probably not: “publish or perish” is more true than ever, and no one is going to risk tenure by publicizing flaws in their own work. However, something like this needs to be done if social science is going to compete in a marketplace of ideas that includes Big Data. Although the two have very difficult cultures, they share the same goal of making sense of the world. And it goes without saying that sense is something today’s world dearly needs.

Happy Thanksgiving!

Happy Thanksgiving to you and yours! Here’s a short clip of a gentleman who was more than ready for the big meal:

Turkey! (Funny Game Show Answers)

Hadoop Basics: What is It, and How Does it Work?

Hadoop Basics: What is It, and How Does it Work?

Today, organizations can centralize their data with Hadoop using familiar tools such as SQL and Python. However, it’s hard to take the first step unless you understand how Hadoop works. In this post, I’ll recount Hadoop’s origin story and the problem it was designed to solve.  After that, I’ll discuss its two main parts:

  • The Hadoop Distributed File System, which stores large data sets
  • The MapReduce algorithm, which quickly analyzes them

After reading this post, you’ll have a better understanding of how Hadoop works, and the types of problems it is designed to solve. You can read more about these problems (and how they can damage your business) on this page. You can also arrange a free consultation to discuss your specific needs, and design a roadmap to meet them.

Why It’s Called Hadoop

Hadoop is the name of a stuffed elephant that was much beloved by the son of Doug Cutting, one of Hadoop’s earliest designers. You can see both Mr. Cutting and the toy elephant here .

Programmers like in-jokes, and animal references are a recurrent theme. Because of this, Hadoop’s symbol is an elephant, and its components often refer to animals:

  • Hive, which lets you query Hadoop like a traditional database, using SQL
  • Pig, which analyses unstructured data, such as comments on a web forum
  • Mahout, which builds predictive models based on Hadoop

Pig gets its name because “pigs will eat anything”. Mahout, which builds machine-learning models, is the Indian term for “elephant trainer”: hence, Mahout trains the Hadoop elephant.

Why Hadoop Was Invented

Hadoop began with a Google white paper on how to index the World Wide Web to provide better search results. Today, Google is so dominant that we forget Yahoo!, AltaVista, ChaCha (!), and AskJeeves (!!!) were once genuine competition.

To get there, though, they had to catalogue millions of websites and direct users to them in real time. Their programmers quickly learned that databases like Oracle and SQL Server were too slow.

To see why, imagine opening a very large Excel spreadsheet: it takes so long you may question your decision to open it at all. This is because large files have long loading times. Hadoop was invented to load and analyze large data sets (like an index of all web sites) that traditional databases can’t handle.

There are two parts to Hadoop: the Hadoop file system, and the MapReduce algorithm. We’ll review each in turn.

How Hadoop Works, Part I: The File System

One solution to long loading times is to store data as many smaller files: for example, keeping each year of customer data in a separate file. This can work, but it also breaks up your data set. Imagine if we could store the data as many small files, but work with it as one large file.

This is what Hadoop’s Distributed File System (HDFS) does. The user sees one file, but “behind the scenes” HDFS stores the data as hundreds of smaller files. It also does all the bookkeeping, so that the only difference to the user is that the entire data set loads much faster. See the following graphic:

this level of happiness is typical of the Hadoop user

In technical terms, HDFS overlays the operating system’s own file system. Since Hadoop runs on Linux, the actual data is stored as a group of Linux files. We call HDFS a logical file system, because it looks like the data set is stored in one large file even when it is actually spread out among many different files.

This isn’t a new idea: computers already store larger files as a sequence of multiple blocks. This is done to use as much of the disk as possible, by filling in any “holes” left after files are deleted. As with HDFS, the user doesn’t see any of this, because the operating system (e.g. Windows) keeps track of the blocks belonging to each file.

How Hadoop Works, Part II: The MapReduce Algorithm

After loading the data, we want to do things with it. For example, Google’s search needs to find the best websites for each user: how is this possible when the data is split into hundreds of small files?

The answer is Hadoop’s MapReduce algorithm. Imagine your data is spread out in hundreds of small files. You don’t know which file stores which data. Because of this, you must analyze all the files: for example, searching every file for the right web site.

However, searching the files one by one would take too long. Instead, MapReduce searches them all simultaneously. See the following graphic:

it took longer to find a good icon for MapReduce than to write this post

Technically, this type of search is called a map: the user defines a function and applies it to a data set to get a new data set. Here, the function just returns TRUE if the record contains the name the user is searching for, and FALSE otherwise.

Fortunately, that’s not all a map can do: for example, it might return a 1 for every word in a document. Then, the program has a data set with one column listing ever word, and another containing ones:

apple 1
avocado 1
banana 1
banana 1
date 1

The program could pass this file to a reducer that totals the second column for each word. The reducer’s output would count how many times each word appeared in the document:

apple 1
avocado 1
banana 2
date 1

By using both a mapper and a reducer, Hadoop’s MapReduce algorithm can do almost any type of analysis or calculation. However, this type of program can be challenging to write. For example, assume we have two files of customer names. We want to sort all six names in alphabetical order:

File 01:

File 02:

We can sort each file by itself, but we won’t end up with all six names in the correct order. To do that requires some understanding of the mathematics of how mapper and reducer functions interact.

This poses a challenge, because most programmers aren’t mathematicians. The good news is that there are tools to generate MapReduce code from more familiar languages, such as SQL: I’ll explore these in a future post. In the meantime, if you have any questions or comments on this post, please fill out the contact form below, and I’ll do my best to answer them.

Thoughts on this post? Tell me all about it.

Cloudera Quickstart: Building a Basic Data Lake and Importing Files

Cloudera Quickstart: Building a Basic Data Lake and Importing Files

In an earlier post, I described how to install Cloudera’s Quickstart virtual Hadoop cluster and connect to it with Talend Open Studio:

Connecting Talend Open Studio to Cloudera Quickstart

Now, I’ll describe how to configure your cluster so you can easily load data into it. In a future post, we’ll use Talend Open Studio to import stock market data and prepare it for later analysis.

First, we need to transfer the data from your host computer into the guest machine running inside Virtual Box: we’ll enable File Sharing in Virtual Box to do this. Once that’s done, we can import the data into the virtual cluster itself. Along the way, we’ll create directories within the Hadoop file system to store the data, and set the correct file ownership and permissions.

Before You Begin

Before following these instructions, you need to have a working Cloudera Quickstart cluster. This post shows how to set this up (connecting to Talend Open Studio isn’t required for this post, but will be necessary later on). Also, we’ll be using the Linux terminal quite a bit, because it makes the job significantly easier. This page gives a comprehensive overview of how the terminal works, and is a good reference if you get stuck.

A Brief Overview of the Hadoop File System

Hadoop manages files with the Hadoop Distributed File System (commonly abbreviated as HDFS). This is a logical file system that overlays the actual file system used by the operating system. Since Hadoop runs on Linux, all files in HDFS are ultimately stored by Linux as regular Linux files. However, instead of storing a huge data set as just one (Linux) file, HDFS stores it as many smaller files, even though to the user it still looks like one big file. Since it’s much quicker to load and process many smaller files than one big one, HDFS let you load and analyze large data sets much more quickly than a traditional database without needing to deal with all the smaller files yourself.

The drawback is that you can’t deal with the data set directly, either through the Linux terminal, or a GUI such as KDE or Gnome. Instead, you must use the specialized commands provided by Hadoop. These are modeled after Linux terminal commands, and will be familiar if you have used the terminal before: for example, the ls lists files in a directory within HDFS. Instead of running the commands directly, however, you pass them as parameters to Hadoops’s hdfs command. The syntax is:

hdfs dfs -command [arguments]

Here, hdfs tells Linux that we’re working within Hadoop’s file system, while dfs means that we are issuing a disk file system command (technically, the command hdfs invokes the hdfs service, which lets you perform a wide range of file system and administration tasks). The -command parameter is a file system command such as ls, and the [arguments] are things like the directory to list, or the files to copy. For example, to list the files in the root directory on a Hadoop cluster, you can type:

hdfs dfs -ls /

At least, that’s how it works in theory: in practice, you usually need to run the hdfs command as another user who has the appropriate permissions on the Hadoop file system. By default, directories in HDFS are owned by the hdfs user: this is a Linux service account that Hadoop allows to have access to its file system. You can change directory ownership to another user, but during initial setup, you’ll need to run your Hadoop file system commands as the hdfs user. You do this with the Linux command sudo: for example, to list the files in the root directory on a Hadoop cluster with default permissions, you’ll need to type:

sudo -u hdfs hdfs dfs -ls /

Let’s step through this: the sudo -u hdfs part tells Linux to run the following command as the user hdfs. In this case, that command is also called hdfs, and we pass it the parameters dfs -ls / to tell it to list the root directory. This is admittedly confusing: just remember that hdfs is in the command string twice because it is both the name of the command and the name of the user running the command. Contrast this with listing a directory owned by the user cloudera:

sudo -u cloudera hdfs dfs -ls /

We’ll discuss how Hadoop permissions and Linux permissions interact in more detail later in this post. For now, I’ll just say that whenever anything goes wrong on a Hadoop cluster under my administration, the first thing I check is whether the executing process has the correct permissions. Also, you can find documentation on the Hadoop file system commands on this page:

Apache Hadoop File System Command Documentation

Step One: Moving Files From the Host Computer to VirtualBox

Before we can move files to the Hadoop cluster, we need to get them from the host computer into the guest machine running on Virtual Box. The most straightforward way to do this is to set up File Sharing in VirtualBox. To do this, first start the Quickstart VM, then open Machines | Settings | Shared Folders, and you’ll see this dialogue box:

Virtual Box File Sharing Dialogue Box 1

Click the folder icon that has a “+” on it (on the right border of the dialogue box) and browse to the folder that you want to share with the guest machine. I like to create a folder just for this purpose called VirtualBoxShare. After you’ve selected the folder, check the Auto-mount and Make Permanent boxes, then click OK. Here is what the dialogue box should look like.

Virtual Box File Sharing Dialogue Box 2

Now, even though you selected auto-mount, you still need to mount the shared folder yourself the first time. To do this, open a Terminal window and type the following commands:

sudo VBoxControl sharedfolder list

This tells VirtualBox to list the folders on the host machine that are available for sharing; you should see the name of the folder you selected. This is the name that VirtualBox has assigned to this folder: you’ll use this name with the mount command, rather than the full Windows path (which is what I, at least, expected). To mount the folder, type the following commands:

mkdir share
sudo mount -t vboxsf VirtualBoxShare ~/share

If all goes as planned, this should mount your selected folder (VirtualBoxShare in this example) in the directory ~/share. You should now be able to access files stored in the VirtualBoxShare folder on the host machine through the ~/share directory on the guest machine. You can test this by moving a file into the VirtualBoxShare folder, then listing the ~/share directory. Here is what this looks like on my machine.

Virtual Box Terminal Window Listing Shared File

Step Two: Moving Files from the Guest Machine to the Hadoop Cluster

Once file sharing is set up, you can access files from the guest machine within Virtual Box through the ~/share folder. The next step is to load them into the Hadoop cluster itself (recall that Hadoop uses its own file system that overlays the Linux file system). Hadoop provides the copyFromLocal command to do this: however, since we have access to Cloudera Manager, we can use it instead of the Terminal. First, though, we need to create a directory structure within the Hadoop file system and assign the correct file permissions, and that is best done from the terminal.

There are not just entire books, but entire collections of books on data warehouse design (Kimball is my personal favorite). Here, I’m going to present a simple architecture that can be easily extended to accomodate a wide variety of data. We will do all our work as the user cloudera, and store all our data in the /user/cloudera/datalake directory. This directory will have one will have 3 subdirectories called ingest, stage, and final; in turn, each of these will have one subdirectory for each data set we are working with. Here is how this looks for the stockdata data set that we will work with later on:

Cloudera Quickstart Data Lake Directory Structure

As can be seen, the contents of the /user/cloudera/datalake/ingest/stockdata directory are the actual stock data files themselves. In this case, these files are named after the stock ticker symbol, and are *.csv files that contain daily market data on each stock. Cloudera Manager lets you view the files just by clicking on them: here is what AAPL (Apple) looks like:

Cloudera Quickstart Data Lake APPL File Contents

From now on, we’ll refer to the directories in the cluster where we store our data as the data lake. The first step in building the data lake is to create the directory structure: for that, we’ll use the mkdir command provided by HDFS. Open a Terminal window and type the following:

sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake
sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/ingest
sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/stage
sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/final

This will create the basic architecture. We will copy the data from the guest machine (i.e. the Linux file system) into the ingest folder: this will be a straight copy, with no validation or transformations. From there, we’ll copy the data from the ingest directory to the stage directory, applying any data validation and transformations in the process. Finally, we will copy the data from the stage directory to the final directory: this will be the data used in production work, for reporting and analysis. Of course, we can also analyze the raw data itself: for example, we can build a machine learning model that uses features of the data set that we don’t need for reporting.

The data set well we working with is called stockdata, so go ahead and create subdirectories for this under ingest, stage, and final.

sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/ingest/stockdata
sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/stage/stockdata
sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/final/stockdata

At this point, we have created the directories we need to load and transform the stock data. You can verify this using Cloudera Manager by opening the Hue service from the main screen, then clicking on Hue Web UI in the menu bar, followed by File Browser. This will open a GUI browser showing the folders in /user/cloudera. Navigation is as you’d expect: just click on a folder to see its contents.

There’s one more step before we can upload data into the Hadoop cluster. If you look at the directories you created, you’ll see that they are all owned by the user hdfs: this makes sense, because we created them as that user. However, if you try and upload data through the Hue File Browser, you’ll get a permissions error, because you are logged in as cloudera, and don’t have write access to the folder. There are a few different ways to resolve this, but first, let’s talk a bit about rights management in Hadoop and Linux.

A Brief Overview of Hadoop (and Linux) Permissions

Since Hadoop is a set of services that run on Linux, Hadoop permissions are ultimately resolved into Linux permissions; that said, for practical purposes, Hadoop and Linux are two distinct domains. You can have permission to access a file on Linux, but not be able to access it through HDFS (and vice versa). Furthermore, the Linux root user has no special privileges within the Hadoop domain: HDFS treats it just like any other user. This means that even logging in as root is not guaranteed to give you access to a file within HDFS.

Both Linux and HDFS set permissions on individual files (and directories, which are considered files within both domains). File permissions determine whether the file can be read, modified, or executed by the file’s owner, users that belong to the file’s group, and everyone else. In Linux, file permissions are set by the chmod command, and file ownership by the chown command; the corresponding commands on HDFS have the same names, but are passed as parameters to the hdfs command (just like other HDFS commands). To change the permissions of the file myFile so that it can be read by everyone, you type:

sudo -u hdfs hdfs dfs -chmod +r myFile

Similarly, to change the owner and group of the file to cloudera, you type:

sudo -u hdfs hdfs dfs -chown cloudera:cloudera myFile

For the exact syntax, you can refer to this page on Understanding Linux File Permissions.

The challenge with Hadoop permissions is that most file system commands you execute through the console need to run under the hdfs user: however, you should run reporting and analysis jobs as another user (such as cloudera) that has a login shell and password. One solution is to give cloudera access to the supergroup group, which is the group that files owned by HDFS belong to by default. To do this, you need to create supergroup as a Linux group, then add cloudera to it. This will work, but I prefer to change the ownership of files that will be used by cloudera to that user. For our purposes, this includes every file stored in the /user/cloudera/datalake directory (i.e. the data lake proper). You can do this with the command:

sudo -u hdfs hdfs dfs -chown -R cloudera:cloudera /user/cloudera/datalake

This is just like the chown command shown earlier, with one crucial difference: the -R switch will change the ownership for all files in the directory and its subdirectories. After you run it, you will be able to access and analyze all files in the data lake as the user cloudera from both the Terminal and the HUE File Manager.

Hadoop permissions don’t stop there: Cloudera Manager has it’s own permissions that are separate from both Linux and HDFS. Rather than being set on individual files, these are user-based, similar to database permissions that restrict which users can create and modify the data. By default, cloudera has Full Administrator permissions, so you shouldn’t need to modify these for this series of posts: however, you can learn more about them at Cloudera’s documentation if you are interested.

Loading Data into the Cluster with the HUE File Manager

Finally, you can load data into the data lake using the HUE File Manager. To do this, open the Hue service from the Cloudera Manager main screen, then click on Hue Web UI in the menu bar, followed by File Browser. After you’ve done that, click the Upload button in the upper-right corner to open the following dialogue box:

Cloudera Quickstart HUE File Upload Dialogue Box

Navigate to the VirtualBoxShare folder, select the file, and voila!


At this point, you should have a working data lake and the means to load data into it. However, the data isn’t in a form that’s easy to work with. Also, it would be nice to have an established workflow to load large data sets consisting of several hundred files into the data lake, rather than having to load each file with HUE.

We can solve the first problem by creating a Hive table that will let us query the data using a variant of SQL; for the second, we can use Talend Open Studio’s excellent data ingestion components. The next post will show how to do this with data on stock closing prices, and provide a framework that you can adapt to load, transform, and analyze other data sets.

If you don’t have a working data lake–that is, if these steps aren’t working for you, or you’re running into a problem you can’t resolve–feel free to use the form below to let me know, and I’ll do my best to help if I can.

Things not working as they should? Let me know about it!!

Connecting Talend Open Studio to Cloudera Quickstart

Connecting Talend Open Studio to Cloudera Quickstart

Cloudera’s Quickstart virtual machine is an excellent way to try out Hadoop without going through the hassle of configuring the packages yourself: it’s like the “live CD” of the Hadoop world. Even in real-world scenarios, the freedom the develop within a virtual environment streamlines the development process, and makes it safe to experiment: if something goes wrong, you can just revert to the last snapshot where things actually worked.

Talend’s set of products are well-suited for designing ETL workflows between Hadoop and related technologies, and Talend provides a free version of their flagstone product, Talend Open Studio. In this post, I’ll show you how to set up Talend Open Studio to access the Cloudera Quickstart virtual Hadoop cluster.

Before You Begin

For this project, you’ll need enough memory to run Talend Open Studio, VirtualBox, and the Cloudera VM simultaneously. You’ll need to assign the VM at at least 8 GB of RAM to do anything useful with it: Open Studio and Virtual Box will take another 1 GB simply to load, and on my Windows 10 laptop, Windows itself seems to take another 1 GB on average. Realistically, you’ll need 16 GB or more of memory to make this work.

Download and Install the Software

You’ll need to download and install the Cloudera Quickstart VM, Talend Open Studio, and VirtualBox. Open Studio and Virtual box are Windows applications, while the Quickstart VM is a virtual machine definition that you will import into VirtualBox. The following links are current as of this post:

VirtualBox has as installer, so just run through it, keeping the defaults unless you have good reason to change them.

The Talend Open Studio download is actually an archive of the entire program: you “install” it by unzipping the archive into whatever directory you want the program to run from. I recommend making a c:\Talend directory and extracting the files there. Do not put it in c:\Program Files or another one of Windows’s privileged folders, because Open Studio relies heavily on third-party libraries, which it downloads and installs as needed. Also, it’s difficult to move the Talend workspace directory after starting your first project, so make sure wherever you extract the program is where you want it to stay.

The Cloudera Quickstart VM is a fully-functional version of CentOS with standard Hadoop packages installed. It also has a version of Cloudera Manager, which you can use to manage the cluster. Cloudera provides the VM for VirtualBox as an *.ovf file, so you don’t need to create a new virtual machine in VirtualBox. Instead, double-click the file, and VirtualBox will open it and present you with a list of configuration settings: change the number of processor cores to 2, and increase the RAM to 8 GB (8192 MB); leave everything else the same.

Connecting Open Studio to the Quickstart VM

Now, open VirtualBox, select the Quickstart VM, and power it on. Then open Talend Open Studio and navigate to Metadata | Hadoop Cluster, then right-click and select Create Hadoop Cluster. You will see the following dialogue box: enter a name for your cluster, fight down the rush of guilt you feel from leaving the Purpose blank, then click Next.

Figure 1: New Hadoop Cluster Connection on repository – Step 1/2

Next, Talend will present you with different ways to enter the cluster’s configuration information. We will import them from the cluster itself, so select “Cloudera” for the Distribution and leave the Version unchanged. Then, set the Option radio button to “Retrieve configuration for Ambari or Cloudera”: this will tell Open Studio to connect to the cluster, instead of you needing to enter the configuration settings by hand. Here is what the dialogue box should look like:

Figure 2: Hadoop Configuration Import Wizard with Correct Selections

Click Next and enter the following settings as Manager credentials:

  • Manager URI (with port): http://quickstart.cloudera:7180/
  • Username: cloudera
  • Password: cloudera

Leave the other fields unchanged, then click Connect, and Open Studio will try to connect to the Quickstart VM: if it is successful, it will show “Cloudera Quickstart” in the Discovered clusters list box. Click Fetch and Open Studio will retrieve the configuration settings for HDFS, YARN, and other Hadoop services. Here is what the dialogue box will look like if this process is successful:

Figure 3: Configuration settings for Cloudera Quickstart VM.

Click Finish to see the settings that Talend Open Studio has imported in its metadata for the Cloudera Quickstart VM. These should all be correct. Under Authentication, be sure the User name is set to cloudera, so the dialogue box looks like this.

Figure 4. Imported Quickstart VM Configuration Settings

This last part is a bit tricky: click on Check Services button, and Open Studio will try to connect to the Quickstart cluster using the configuration settings it just imported. However, it won’t be able to connect. This is because, while the settings are correct, the host machine (i.e. Windows) can’t communicate directly with the guest virtual machine (i.e. the Quickstart VM): instead, the host communicates over a network protocol, just as if the guest was a physically separate machine. VirtualBox mediates this communication by listening to certain IP addresses on the host and routing communication between the host and guest as needed.

default, it listens on localhost (i.e., so we need to modify the Windows hosts file to associate the name quickstart.cloudera with the address This tells Windows to send all networking requests for quickstart.cloudera to the localhost address, which is the address that VirtualBox uses to communicate between running virtual machines and the host computer. Here are instructions on how to edit the hosts file on different versions of Windows:

Modify Your Hosts File

After you save the modified file, click on Check Services again. This time, you should see the following dialogue box, indicating that Open Studio can communicate with the Quickstart VM.

Figure 5: Open Studio able to Communicate with Quickstart VM


At this point, you should be able to write jobs in Open Studio using the metadata for the Cloudera Quickstart VM. If you run into errors, follow the usual troubleshooting sequence (e.g. check basic connectivity, firewall rules, etc.); you can also contact me using the form below, and I’ll do my best to help.

Once things are up and running, you need to create a data lake and start ingesting files. The next tutorial shows just how to do that.

Things not working as they should? Let me know about it!

The Naive Bayes Algorithm: Step by Step

The Naive Bayes Algorithm: Step by Step

[mathjax]The naive Bayes algorithm is fundamental to machine learning, combining good predictive power with a compact representation of the predictors. In many cases, naive Bayes is a good initial choice to estimate a baseline model, which can then be refined or compared to more sophisticated models.

In this post, I’m going to step through the naive Bayes algorithm using standard R functions. The goal is not to come up with production-level code–R already has that–but to use R for the “grunt work” needed to implement the algorithm. This lets us focus on how naive Bayes makes predictions, rather than the mechanics of probability calculations.

Files Used in This Post

You can download the complete script at the following address.

Files for post “The Naive Bayes Algorithm: Step by Step”

The Idea Behind the Algorithm

We start with a standard machine learning problem: given a data set of features and targets, how can we predict the target of a new query based on its features? Assume we have:

  1. A list of customers
  2. Measures of their engagement with our website (each with the value Low, Moderate, or High)
  3. Whether each customer signed up for Premium service

In classical statistics, we treat the values of the measurements as evidence, and ask whether there are more customers with specific values who signed up for Premium service, or who did not subscribe. For example, assume the measures of engagement are (Low, Low, Moderate), and that 4 customers with those values signed up for Premium service, while 10 did not. We would predict a customer with that level of engagement would not subscribe.

One complication is that there are relatively fewer customers who sign up compared to those that don’t: in that case, given the same measurements, we’re less likely to predict a customer will sign up because our data set has fewer cases where they did. Bayes Theorem mitigates this by weighting the measurements with the unconditional probability that the customer did or did not sign up: that is, the probability that a customer did subscribe, regardless their engagement.

This revision leads to better predictions; however, it requires us to calculate the relationships between all the measures, as well as each measure and the targets. in real data sets, this isn’t feasible: instead, we assume the measures are independent of one another, and just calculate the relationship between the measures and the target. This is the naive Bayes model, so called because the assumption of independent measures is made whether or not it is literally true.

A Step-By-Step Example

For this example, we assume than an accounting firm wants to predict the audit risk of a client based upon the characteristics of their tax return. There are four characteristics that we are interested in, along with their possible values.

  1. Percentage of Business that is Cash-Based: “Less than 25%”, “Between 25% & 50%”, “More than 50%”
  2. Number of Wire Transfers: “Zero”, “Less than 5”, “5 or More”
  3. Home Office: “Yes”, “No”
  4. Amount of Tax Owed: “Less than $1,000”, “Between $1,000 and $5,000”, “More than $5,000”

The target (Audit Risk) can be Low, Moderate, or High. We want to develop a naive Bayesian model that predicts the Audit Risk for a generic client based upon these characteristics.


We start by removing all objects from the workspace, then creating the data set and assigning factor levels to make it easier to interpret.

# NaiveBayesStepByStep.R
# Illustrates the calculations used by the naive Bayes algorithm to predict a query target level given a training data set
# c. 2017 David Schwab

# Preliminaries

options(digits=3)    # This makes the numbers display better; feel free to change it.

# Construct the data set <- factor(c(1,1,1,2,2,1,2,2,1,3,3,3,3,2,3,3,1,1,2))
wire.transfers <- factor(c(1,1,2,1,2,1,2,2,1,3,3,3,3,2,3,3,1,1,2)) <- factor(c(1,1,1,1,2,2,1,2,2,2,2,1,1,1,2,1,2,1,2))
tax.owed <- factor(c(1,1,1,2,2,2,1,1,2,3,3,3,3,1,3,3,2,1,1))
audit.risk <- factor(c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,1,2,3,2,1))

# Add the factor levels and make the data frame

levels( <- c("Less than 25%", "Between 25% & 50%", "More than 50%")
levels(wire.transfers) <- c("Zero", "Less than 5", "5 or More")
levels( <- c("Yes", "No")
levels(tax.owed) <- c("Less than $1,000", "Between $1,000 and $5,000", "More than $5,000")
levels(audit.risk) <- c("Low", "Moderate", "High") <- data.frame(,wire.transfers,,tax.owed,audit.risk)

Calculate Conditional Probabilities for Each Feature

Next, we calculate the conditional probability that each feature is associated with each target. There are 9+9+6+9=33 separate probabilities, so we will let R do the calculation using table() and apply().

To start, we count the number of times each target level occurs and store it as audit.risk.targets. We use table() to get the counts, then cast them to numeric.

Next, we call apply() with an in-line user-defined function: this function creates a contingency table between one of the features and the target. We transpose the table and divide it by audit.risk.targets, which R recycles for each row. The result, audit.risk.cond.prob, is a list of the conditional probabilities that each feature takes each target level (for housekeeping, we delete the final element, which has audit.risk in both rows and columns).

# Next, count the instances of each target level

audit.risk.targets <- as.numeric(table($audit.risk))

# Now, calculate the conditional probabilities needed to predict a target with the naive Bayes model

audit.risk.cond.prob <- apply(, 2, function(x){
  t(table(x,$audit.risk)) / audit.risk.targets

audit.risk.cond.prob$audit.risk <- NULL       # Remove extraneous data

Calculate Prior Probabilities for Each Target Level

All that’s left is to calculate the prior probabilities for each target level: this is just the count of each level divided by the total number of data points. We also create a separate display variable for easier interpretation.

# Calculate the target priors

audit.risk.priors <- audit.risk.targets / nrow(
audit.risk.priors.display <- data.frame(target=c("Low", "Moderate", "High"),priors=audit.risk.priors)

Using the Model to Make a Prediction

To make a prediction, we display both the conditional and prior probabilities. Here is how they will look: note that we display each feature separately to arrange the columns in the correct order, using the levels defined earlier.

> audit.risk.priors.display

    target priors
1      Low  0.316
2 Moderate  0.368
3     High  0.316

> audit.risk.cond.prob$[,levels(]
           Less than 25% Between 25% & 50% More than 50%
  Low              0.500             0.333         0.167
  Moderate         0.429             0.429         0.143
  High             0.167             0.167         0.667

> audit.risk.cond.prob$wire.transfers[,levels(wire.transfers)]
            Zero Less than 5 5 or More
  Low      0.500       0.333     0.167
  Moderate 0.429       0.429     0.143
  High     0.167       0.167     0.667

> audit.risk.cond.prob$[,levels(]
             Yes    No
  Low      0.667 0.333
  Moderate 0.429 0.571
  High     0.500 0.500

> audit.risk.cond.prob$tax.owed[,levels(tax.owed)]
           Less than $1,000 Between $1,000 and $5,000 More than $5,000
  Low                 0.667                     0.167            0.167
  Moderate            0.429                     0.429            0.143
  High                0.167                     0.167            0.667

To make a prediction, consider the query q=(Between 25% and 50%, Less than 5, Yes, Less than $1,000). We need to calculate the naive Bayes estimate for each target level; the prediction is the target level with the greatest value.

By inspection, we can see that when the target level is Low, the conditional probabilities are .333, .333, .333, .667: the joint conditional probability is their product, which is 0.025. The prior probability of Low is 0.316, so we multiply the product by this to get the naive Bayes estimate of 0.008. Similar estimates for Moderate and High are 0.017 and 0.001, respectively. The level Moderate has the greatest value, so that is the naive Bayes prediction for this query.

NOTE: In this example, the naive Bayesian estimates are not the actual posterior probabilities of each target level: however, their relative ranking is identical to the actual probabilities, so we can be confident that Moderate is the correct estimate.


As we can see, the naive Bayes algorithm allows a complex data set to be represented by relatively few predictors. It also performs well in many different applications, and has the intuitive appeal of predicting the most probable target level based on the relative frequency of the different target levels, as well as the set of features. For production work, the e1071 package provides the naiveBayes() function: you can find a good overview of its use (as well as more detail on the theory behind the algorithm) here.

Troubleshooting at a standstill? Feel free to reach out!

Using R to Select an Optimum Stock Portfolio: Part I–Preliminaries

Using R to Select an Optimum Stock Portfolio

Part I: Preliminaries


[mathjax]The portfolio selection problem is to decide which combination of securities you should hold given your investment objectives and tolerance for risk. As with investing in general, the trade-off is between risk and return: the more diversified your portfolio (i.e. the more securities you hold), the more insulated you are from large decreases in any particular security. One the other hand, the more securities you hold, the closer your average return will be to the market as a whole: substantial increases in some securities will be mitigated by decreases in most others.

Script Description

The following R script uses Modern Portfolio Theory to provide one answer to this question. Here’s how it works:

  • First, the script loads daily stock price data for all stocks in the S&P 500 index, and the index itself. It trims extreme values and verifies that the stock returns are normally distributed, which is required by the model.
  • Next, it calculates the relationship between each stock and the market as a whole by regressing daily stock returns on daily index values. The coefficient Beta (\beta) is a measure of how much the return on each stock varies with movements in the overall market.
  • Third, it calculates the excess return to Beta: this is a measure of how much additional return the stock provides relative to the market: the greater the excess return, the greater the potential reward relative to the risk of holding that stock, compared to the overall risk of the market as a whole.
  • Finally, the script estimates a cutoff value based on these excess returns: in this model, the optimum portfolio contains all stocks with an excess return to Beta greater than this cutoff value.

Files Used in This Post

You can download all files referenced here at the following address. This includes the full R script, both shell scripts, and the list of stocks; I’ve also included an example of the risk-free rate calculation.

Files for post “Using R to Select an Optimum Stock Portfolio with Modern Portfolio Theory”

Now, let’s get started!

Before You Begin

Before you can run the script, you need to do the following:

  1. Download daily prices for all stocks in the S&P 500 index
  2. Download daily S&P 500 index values
  3. Calculate the risk-free rate

Download Daily Stock Prices

The R script uses the daily closing price of each stock in the S&P 500 index, so we need to provide it with that data. To do this, run the StockDownloadScript shell script, which reads as follows:


if [ -z "$2" ]
    then directory="."

while read -r symbol
    wget "$symbol" -P "$directory"
done < "$filename"

You run this from the terminal as follows:

./ StockList StockData

Here, StockList is just a list of the stock symbols that make up the S&P 500, and StockData is the download directory (or current directory, if not specified). This will download daily closing prices for each stock to a separate file in the download directory. The files will look like this:


The upper-case letters are the stock symbols: for example, “A” is the symbol for Agilent Technologies, Inc. The “table.csv?s=” is a leftover part of how Yahoo stores their stock data; we can remove this with the RenameStockFiles script, which looks like this:

if [ -z "$1" ]
    then directory="."

for file in "$directory"/table.csv?s=*; do
    mv "$file" "${file/table.csv?s=/}"

You run this from the terminal as follows:

./ StockData

This will strip out the table prefixes so that your directory contains one file for each stock.

Download S&P 500 Index Values

The main part of the R script estimates the correlation between the closing price of each stock and the closing value of the S&P 500 index on the same day. You can download the index values directly from Yahoo! Finance at S&P 500 Index Historical Values. Set the date range from January 1, 2016 to today’s date, and make sure you select Daily Returns. Download the file to your stock data directory and rename it “SPX”.

Calculate the Risk-Free Rate

Since we are considering the relative risk of owning stocks compared to other investments, we need to know what return we could expect from an asset without any risk. Of course, there’s no such thing, but the yield on the 10-year Treasury bond is often used as a good approximation, since it’s backed by the full faith and credit of the United States government.

You can find the daily yield rates on the U.S. Treasury Department’s website at Daily Treasury Yield Curve Rates. You’ll need to highlight the rates with your mouse and copy them to a spreadsheet, and you’ll need to do this separately for both 2016 and 2017 data (as of this post, the XML download feature was broken). In the spreadsheet, calculate the geometric mean of the 10-year rate using the GEOMEAN() function: this will be the risk-free rate.

Next Steps

Now that the preliminaries are done, we can move on to selecting an optimum portfolio, as described in this page:

Selecting the Stocks

If something isn’t working, please fill out the contact form below, and I’ll do my best to help.

Troubleshooting at a Standstill? Feel Free to Reach Out: