The Ethical Minefields of Technology

I whole-heartedly agree with the author that we need to seriously consider the ethical implications of all this new technology, especially when the window in which someone can “opt-out” of adopting it keeps shrinking. Personally, I’m not interested in owning a self-driving car–I actually like the experience of driving–but eventually, I’m not going to have a choice.

https://blogs.scientificamerican.com/observations/the-ethical-minefields-of-technology/

What Statistics Can Learn From Data Science (and Vice Versa)

Since its inception around the turn of the 20th century, researchers have used classical statistics to analyze data sets. In general, the focus has been on analyzing a sample of the data, then generalizing the findings to the entire population. This was born of necessity, as until very recently, the technology hasn’t existed to allow people to analyze entire data sets containing millions of data points.

Because only a sample of the data is analyzed, statisticians spend a great deal of time ensuring that the assumptions allowing the sample results to be generalized are met. At least, they try to do so: in practice, statistical modelling is an exercise in how many assumptions you can violate without compromising your analysis; as a result, the methods that get used are not necessarily the most powerful, but the ones that are most robust to these violations.

On the other hand, machine learning models are typically validated by testing model performance against a holdout sample of data points that aren’t used to estimate the model’s parameters. Models that don’t perform well on the holdout sample are discarded. This helps guard against over fitting, because the model will only perform well if it contains those features that have predictive value for the entire data population.

To the extent possible, researchers using classical statistics should also use this type of external validation. This is true even when model assumptions seem to be met, as the data set itself represents just one point in time. When the number of data points is small, cross-validation methods, which test a series of models against a very small holdout sample (often one case) can be used. Since these types of models are often parsimonious, they might also be validated against similar data sets, with an aim to only including those features that maintain their predictive power across each data set tested.

In addition, researchers using machine learning should verify their model assumptions are met. This can be easy to overlook, because the assumptions are often less stringent than in classical statistics. It’s also tempting to focus on the model’s performance against the holdout sample as “proof” that it is correct: this is a potentially serious error, as it may not reveal systematic deviations from the assumptions that exist in both the training and validation data.

The foundation of classical statistics is largely due to Sir Ronald Fisher’s early agricultural experiments, conducted in the early 20th century. While his methods continue to be invaluable, it’s important to remember that they were designed for a world where the analysis was conducted by hand on a small set of data points. That’s not usually the case today: as a result, it’s critical to draw from the best insights of both classical statistics, and contemporary analytics, to develop accurate and reliable models for today’s world.

Should We Let the Government Leverage Its Data to Enact Better Policy?

When I was in graduate school, we debated the merits of a centralized data store that policy makers could use to make better decisions; ultimately, we decided the risks to privacy outweighed the benefits.

Data collected by (ethical) businesses is de-identified, typically by assigning each case with an arbitrary number. Government data isn’t, although as the article below points out, it could be. The bigger concern is that, unlike private organizations, the government can detain, arrest, and even execute people. On the one hand, none of these things happen without due process; on the other, power corrupts, and–what is far more worrying–people make mistakes. Are we willing to accept that?

We might be, if it actually does lead to better policy: having worked for the government, I can tell you that we routinely made decisions on what I’ll call sparse information. Several times, I had to request data from another state agency, and each time we had to draft an agreement specifying precisely what my agency could do with it. And there’s no data standardization across agencies, so sometimes after going through all this, I wasn’t able to merge the two data sets.

Which raises another issue: to make this work, each agency would have to use the same data semantics, file structure, and database application. Even in the ideal case, where everyone can agree on a common data dictionary, each agency’s ability to contribute data will be limited by its own architecture. And, as the second link makes clear, things are usually not ideal.

Let’s Use Government Data to Make Better Policy

Investigation Reveals a Military Payroll Rife With Glitches

Artificial Intelligence and Organizational Change

Here is an interesting take on the evolving distribution of work between humans and artificial intelligence. We need to begin dealing with AI as an actual form of intelligence, of a different nature than ours, and with its own strengths and weaknesses. True, machines may not actually think (yet), but for specialized tasks such as medical diagnosis, they are beginning to outperform experts that do. Yet, unlike human experts, we have no way to judge the machine’s credibility: as the article notes, that will take fundamental changes in how businesses organize and complete their work.

Artificial Intelligence: The Gap Between Promise and Practice

Why We Will Continue to See Catastrophic Software Failures

This article does a good job summarizing why catastrophic software failures are inevitable given our current development practices. There’s an old engineering maxim that once a system becomes sufficiently complex, it will eventually fail no matter how much testing you do. This is why critical systems, such as medical equipment, are designed to “fail safe”.

Unfortunately, almost every application today depends on libraries developed by programmers who may be halfway across the world, which makes designing fail-safe software extremely challenging. Until we address this coordination problem, catastrophic failures like those mentioned in the article continue to be a real danger.

The Coming Software Apocalypse

Machine Learning: Is “Off-the-Shelf” Software Enough?

As more and more businesses use machine learning to inform their decisions, I’m often asked whether”off-the-shelf” software can really take the place of custom solutions.

To answer this, it’s important to distinguish between software that analyzes the data (e.g. estimates a decision tree, or other machine learning model) and software that attempts to interpret the data. In most cases, you don’t need a custom solution to implement the most common algorithms: software such as Talend’s enterprise product can do that with minimal hassle. For those committed to open-source solutions, the R statistical program does more than you’ll ever need it to, albeit with a rather high learning curve (you can see what R makes possible, and the technical skill needed to use it, on this page).

However, while there’s no need to program your own Bayesian classifier from scratch, you’re still likely going to need an experienced data scientist to interpret the data. The reason is that computer programs are ultimately just lists of instructions: they can’t actually “think”. Because of this, they don’t cope well with the unusual, or the unexpected.

For example, I built a model to estimate house prices from property data such as square footage, age, and condition. Overall, it was a good fit, explaining about 85% of the variation in the data: however, there were certain properties that were assigned negative prices, so something was clearly wrong.

After drilling down into the data, I found that these properties were listed as having zero square footage. This was because they were condominiums, and in this data set, the square footage for all units was assigned to the overall building, not the individual units. There wasn’t any way to fix this, so I ended up excluding those cases and building a separate model for condominiums. If I hadn’t understood how the model worked, I wouldn’t have known where to look to find the source of the problem, and I wouldn’t have known to use a different type of model for the other cases.

To summarize, you’ll get the most return on your investment by paying for an experienced analyst to work with one of the off-the-shelf software packages. This gives you the best of both worlds: a proven modelling engine to crunch the numbers, and a skilled professional to interpret them.

Hadoop Basics: What is It, and How Does it Work?

Hadoop Basics: What is It, and How Does it Work?

Today, organizations can centralize their data with Hadoop using familiar tools such as SQL and Python. However, it’s hard to take the first step unless you understand how Hadoop works. In this post, I’ll recount Hadoop’s origin story and the problem it was designed to solve.  After that, I’ll discuss its two main parts:

  • The Hadoop Distributed File System, which stores large data sets
  • The MapReduce algorithm, which quickly analyzes them

After reading this post, you’ll have a better understanding of how Hadoop works, and the types of problems it is designed to solve. You can read more about these problems (and how they can damage your business) on this page. You can also arrange a free consultation to discuss your specific needs, and design a roadmap to meet them.

Why It’s Called Hadoop (and Hive, and Pig, and So On)

Hadoop is the name of a stuffed elephant that was much beloved by the son of Doug Cutting, one of Hadoop’s earliest designers. You can see both Mr. Cutting and the toy elephant here . Programmers like in-jokes, and animal references have been a recurrent theme since the early days of IBM mainframes. Because of this, Hadoop’s symbol is an elephant, and its components often refer to animals:

  • Hive, which lets you query Hadoop like a traditional database, using SQL
  • Pig, which analyses unstructured data, such as comments on a web forum (so named because “pigs eat anything”).
  • Mahout, which builds predictive models based on Hadoop (named for the Indian word for “elephant trainer”).

If this seems too cute, think of it in the spirit of the dozens of newly-discovered species named after Far Side cartoonist Gary Larson (If it’s not too cute, be sure to check out this).

Why Hadoop Was Invented

Hadoop began with a Google white paper on how to index the World Wide Web to provide better search results. Today, Google is so dominant that we forget Yahoo!, AltaVista, ChaCha (!), and AskJeeves (!!!) were once genuine competition. To get there, though, they had to catalogue millions of websites and direct users to them in real time.

In doing so, and Google quickly learned that databases like Oracle and SQL Server were too slow. To see why, imagine opening a very large Excel spreadsheet: it takes so long you may question your decision to open it at all. This is because large files have long loading times. Hadoop was invented to load and analyze large data sets (like an index of all web sites) that traditional databases can’t handle.

There are two parts to Hadoop: the Hadoop file system, and the MapReduce algorithm. We’ll review each in turn.

How Hadoop Works, Part I: The File System

One solution to long loading times is to store data as many smaller files: for example, keeping each year of customer data in a separate file. This can work, but it also breaks up your data set. Imagine if we could store the data as many small files, but work with it as one large file.

This is what Hadoop’s Distributed File System (HDFS) does. The user sees one file, but “behind the scenes” HDFS stores the data as hundreds of smaller files. It also does all the bookkeeping, so that the only difference to the user is that the entire data set loads much faster. See the following graphic:

this level of happiness is typical of the Hadoop user

In technical terms, HDFS overlays the operating system’s own file system. Since Hadoop runs on Linux, the actual data is stored as a group of Linux files. We call HDFS a logical file system, because it looks like the data set is stored in one large file even when it is actually spread out among many different files. This lets us treat the smaller files as a single logical unit, even though the operating system doesn’t “see” them this way.

Note that this isn’t a new idea: computers already store larger files as a sequence of multiple blocks. Sometimes these blocks are next to each other, but often they are split into several groups, with the last block in one group pointing to the first block in another. This is done to use as much of the disk as possible, by filling in the “holes” left after files are deleted with newly created data. As with HDFS, the user doesn’t see any of this, because the operating system (e.g. Windows) keeps track of the blocks belonging to each file, and presents them to the user as a single file. So, HDFS is the really just a proven idea applied at a much larger scale.

How Hadoop Works, Part II: The MapReduce Algorithm

After loading the data, we want to do things with it. For example, Google’s search needs to find the best websites for each user: how is this possible when the data is split into hundreds of small files?

The answer is Hadoop’s MapReduce algorithm. Imagine your data is spread out in hundreds of small files; furthermore, you don’t know which file stores which data. Because of this, you must analyze all the files: for example, Google might search every file for the right web site and stop when the site is found.

You can’t just search one file at a time, however: that takes too long. Instead, you load every file into memory, then search them all simultaneously: the files are small, so they load quickly, and you’re searching them all at once, so you should find the answer quickly. That’s how MapReduce works. See the following graphic:

it took longer to find a good icon for MapReduce than to write this post

Here, the user wants to search a file in HDFS for a customer’s order history. There are billions of orders, so HDFS stores them as thousands of smaller files. Since the order could be in any one of the files, MapReduce searches them all simultaneously: this ensures that the name is found while capping the total search time as the time it takes to search one of the thousand smaller files. As the figure shows, the data can also be partitioned so that only some of the smaller files are actually searched, which frees up processor cycles for other tasks.

Technically, this type of search is called a map: the user defines a function and applies it to a data set to get a new data set. Here, the function just returns TRUE if the record contains the name the user is searching for, and FALSE otherwise. However, it could also do other things, such as returning a 1 for every word in a document: then, the new data set would have two columns, one containing every word in the document, and another containing a series of ones:

apple 1
avocado 1
banana 1
banana 1
date 1

We could pass this file to another function, called a reducer, that totals the second column for each word. The output would count how many times each word appeared in the document:

apple 1
avocado 1
banana 2
date 1

By using both a mapper and a reducer, Hadoop’s MapReduce algorithm can do almost any type of analysis or calculation. However, this type of program can be challenging to write. For example, assume we have two files of customer names. We want to sort all six names in alphabetical order:

We can sort each file by itself, but we can’t just append one to the other and end up with a correct sort of all six names. There’s a similar problem with averaging values stored in multiple files:

We can find the average for each file, then the average of the averages, but it won’t be the right answer:

    • File 01 Average: 8
    • File 02 Average: 20
    • Average of Averages: 14
    • Average of All 5 Values: 15.2

(This is a special case of Jensen’s inequality, an important result in mathematical analysis.)

This poses a challenge, because most programmers aren’t mathematicians. The good news is that there are tools to generate MapReduce code from more familiar languages, such as SQL: I’ll explore these in a future post. In the meantime, if you have any questions or comments on this post, please fill out the contact form below, and I’ll do my best to answer them.

Thoughts on this post? Tell me all about it.

Cloudera Quickstart: Building a Basic Data Lake and Importing Files

Cloudera Quickstart: Building a Basic Data Lake and Importing Files

In an earlier post, I described how to install Cloudera’s Quickstart virtual Hadoop cluster and connect to it with Talend Open Studio:

Connecting Talend Open Studio to Cloudera Quickstart

Now, I’ll describe how to configure your cluster so you can easily load data into it. In a future post, we’ll use Talend Open Studio to import stock market data and prepare it for later analysis.

First, we need to transfer the data from your host computer into the guest machine running inside Virtual Box: we’ll enable File Sharing in Virtual Box to do this. Once that’s done, we can import the data into the virtual cluster itself. Along the way, we’ll create directories within the Hadoop file system to store the data, and set the correct file ownership and permissions.

Before You Begin

Before following these instructions, you need to have a working Cloudera Quickstart cluster. This post shows how to set this up (connecting to Talend Open Studio isn’t required for this post, but will be necessary later on). Also, we’ll be using the Linux terminal quite a bit, because it makes the job significantly easier. This page gives a comprehensive overview of how the terminal works, and is a good reference if you get stuck.

A Brief Overview of the Hadoop File System

Hadoop manages files with the Hadoop Distributed File System (commonly abbreviated as HDFS). This is a logical file system that overlays the actual file system used by the operating system. Since Hadoop runs on Linux, all files in HDFS are ultimately stored by Linux as regular Linux files. However, instead of storing a huge data set as just one (Linux) file, HDFS stores it as many smaller files, even though to the user it still looks like one big file. Since it’s much quicker to load and process many smaller files than one big one, HDFS let you load and analyze large data sets much more quickly than a traditional database without needing to deal with all the smaller files yourself.

The drawback is that you can’t deal with the data set directly, either through the Linux terminal, or a GUI such as KDE or Gnome. Instead, you must use the specialized commands provided by Hadoop. These are modeled after Linux terminal commands, and will be familiar if you have used the terminal before: for example, the ls lists files in a directory within HDFS. Instead of running the commands directly, however, you pass them as parameters to Hadoops’s hdfs command. The syntax is:

Here, hdfs tells Linux that we’re working within Hadoop’s file system, while dfs means that we are issuing a disk file system command (technically, the command hdfs invokes the hdfs service, which lets you perform a wide range of file system and administration tasks). The -command parameter is a file system command such as ls, and the [arguments] are things like the directory to list, or the files to copy. For example, to list the files in the root directory on a Hadoop cluster, you can type:

At least, that’s how it works in theory: in practice, you usually need to run the hdfs command as another user who has the appropriate permissions on the Hadoop file system. By default, directories in HDFS are owned by the hdfs user: this is a Linux service account that Hadoop allows to have access to its file system. You can change directory ownership to another user, but during initial setup, you’ll need to run your Hadoop file system commands as the hdfs user. You do this with the Linux command sudo: for example, to list the files in the root directory on a Hadoop cluster with default permissions, you’ll need to type:

Let’s step through this: the sudo -u hdfs part tells Linux to run the following command as the user hdfs. In this case, that command is also called hdfs, and we pass it the parameters dfs -ls / to tell it to list the root directory. This is admittedly confusing: just remember that hdfs is in the command string twice because it is both the name of the command and the name of the user running the command. Contrast this with listing a directory owned by the user cloudera:

We’ll discuss how Hadoop permissions and Linux permissions interact in more detail later in this post. For now, I’ll just say that whenever anything goes wrong on a Hadoop cluster under my administration, the first thing I check is whether the executing process has the correct permissions. Also, you can find documentation on the Hadoop file system commands on this page:

Apache Hadoop File System Command Documentation

Step One: Moving Files From the Host Computer to VirtualBox

Before we can move files to the Hadoop cluster, we need to get them from the host computer into the guest machine running on Virtual Box. The most straightforward way to do this is to set up File Sharing in VirtualBox. To do this, first start the Quickstart VM, then open Machines | Settings | Shared Folders, and you’ll see this dialogue box:

Virtual Box File Sharing Dialogue Box 1

Click the folder icon that has a “+” on it (on the right border of the dialogue box) and browse to the folder that you want to share with the guest machine. I like to create a folder just for this purpose called VirtualBoxShare. After you’ve selected the folder, check the Auto-mount and Make Permanent boxes, then click OK. Here is what the dialogue box should look like.

Virtual Box File Sharing Dialogue Box 2

Now, even though you selected auto-mount, you still need to mount the shared folder yourself the first time. To do this, open a Terminal window and type the following commands:

This tells VirtualBox to list the folders on the host machine that are available for sharing; you should see the name of the folder you selected. This is the name that VirtualBox has assigned to this folder: you’ll use this name with the mount command, rather than the full Windows path (which is what I, at least, expected). To mount the folder, type the following commands:

If all goes as planned, this should mount your selected folder (VirtualBoxShare in this example) in the directory ~/share. You should now be able to access files stored in the VirtualBoxShare folder on the host machine through the ~/share directory on the guest machine. You can test this by moving a file into the VirtualBoxShare folder, then listing the ~/share directory. Here is what this looks like on my machine.

Virtual Box Terminal Window Listing Shared File

Step Two: Moving Files from the Guest Machine to the Hadoop Cluster

Once file sharing is set up, you can access files from the guest machine within Virtual Box through the ~/share folder. The next step is to load them into the Hadoop cluster itself (recall that Hadoop uses its own file system that overlays the Linux file system). Hadoop provides the copyFromLocal command to do this: however, since we have access to Cloudera Manager, we can use it instead of the Terminal. First, though, we need to create a directory structure within the Hadoop file system and assign the correct file permissions, and that is best done from the terminal.

There are not just entire books, but entire collections of books on data warehouse design (Kimball is my personal favorite). Here, I’m going to present a simple architecture that can be easily extended to accomodate a wide variety of data. We will do all our work as the user cloudera, and store all our data in the /user/cloudera/datalake directory. This directory will have one will have 3 subdirectories called ingest, stage, and final; in turn, each of these will have one subdirectory for each data set we are working with. Here is how this looks for the stockdata data set that we will work with later on:

Cloudera Quickstart Data Lake Directory Structure

As can be seen, the contents of the /user/cloudera/datalake/ingest/stockdata directory are the actual stock data files themselves. In this case, these files are named after the stock ticker symbol, and are *.csv files that contain daily market data on each stock. Cloudera Manager lets you view the files just by clicking on them: here is what AAPL (Apple) looks like:

Cloudera Quickstart Data Lake APPL File Contents

From now on, we’ll refer to the directories in the cluster where we store our data as the data lake. The first step in building the data lake is to create the directory structure: for that, we’ll use the mkdir command provided by HDFS. Open a Terminal window and type the following:

This will create the basic architecture. We will copy the data from the guest machine (i.e. the Linux file system) into the ingest folder: this will be a straight copy, with no validation or transformations. From there, we’ll copy the data from the ingest directory to the stage directory, applying any data validation and transformations in the process. Finally, we will copy the data from the stage directory to the final directory: this will be the data used in production work, for reporting and analysis. Of course, we can also analyze the raw data itself: for example, we can build a machine learning model that uses features of the data set that we don’t need for reporting.

The data set well we working with is called stockdata, so go ahead and create subdirectories for this under ingest, stage, and final.

At this point, we have created the directories we need to load and transform the stock data. You can verify this using Cloudera Manager by opening the Hue service from the main screen, then clicking on Hue Web UI in the menu bar, followed by File Browser. This will open a GUI browser showing the folders in /user/cloudera. Navigation is as you’d expect: just click on a folder to see its contents.

There’s one more step before we can upload data into the Hadoop cluster. If you look at the directories you created, you’ll see that they are all owned by the user hdfs: this makes sense, because we created them as that user. However, if you try and upload data through the Hue File Browser, you’ll get a permissions error, because you are logged in as cloudera, and don’t have write access to the folder. There are a few different ways to resolve this, but first, let’s talk a bit about rights management in Hadoop and Linux.

A Brief Overview of Hadoop (and Linux) Permissions

Since Hadoop is a set of services that run on Linux, Hadoop permissions are ultimately resolved into Linux permissions; that said, for practical purposes, Hadoop and Linux are two distinct domains. You can have permission to access a file on Linux, but not be able to access it through HDFS (and vice versa). Furthermore, the Linux root user has no special privileges within the Hadoop domain: HDFS treats it just like any other user. This means that even logging in as root is not guaranteed to give you access to a file within HDFS.

Both Linux and HDFS set permissions on individual files (and directories, which are considered files within both domains). File permissions determine whether the file can be read, modified, or executed by the file’s owner, users that belong to the file’s group, and everyone else. In Linux, file permissions are set by the chmod command, and file ownership by the chown command; the corresponding commands on HDFS have the same names, but are passed as parameters to the hdfs command (just like other HDFS commands). To change the permissions of the file myFile so that it can be read by everyone, you type:

Similarly, to change the owner and group of the file to cloudera, you type:

For the exact syntax, you can refer to this page on Understanding Linux File Permissions.

The challenge with Hadoop permissions is that most file system commands you execute through the console need to run under the hdfs user: however, you should run reporting and analysis jobs as another user (such as cloudera) that has a login shell and password. One solution is to give cloudera access to the supergroup group, which is the group that files owned by HDFS belong to by default. To do this, you need to create supergroup as a Linux group, then add cloudera to it. This will work, but I prefer to change the ownership of files that will be used by cloudera to that user. For our purposes, this includes every file stored in the /user/cloudera/datalake directory (i.e. the data lake proper). You can do this with the command:

This is just like the chown command shown earlier, with one crucial difference: the -R switch will change the ownership for all files in the directory and its subdirectories. After you run it, you will be able to access and analyze all files in the data lake as the user cloudera from both the Terminal and the HUE File Manager.

Hadoop permissions don’t stop there: Cloudera Manager has it’s own permissions that are separate from both Linux and HDFS. Rather than being set on individual files, these are user-based, similar to database permissions that restrict which users can create and modify the data. By default, cloudera has Full Administrator permissions, so you shouldn’t need to modify these for this series of posts: however, you can learn more about them at Cloudera’s documentation if you are interested.

Loading Data into the Cluster with the HUE File Manager

Finally, you can load data into the data lake using the HUE File Manager. To do this, open the Hue service from the Cloudera Manager main screen, then click on Hue Web UI in the menu bar, followed by File Browser. After you’ve done that, click the Upload button in the upper-right corner to open the following dialogue box:

Cloudera Quickstart HUE File Upload Dialogue Box

Navigate to the VirtualBoxShare folder, select the file, and voila!

Conclusion

At this point, you should have a working data lake and the means to load data into it. However, the data isn’t in a form that’s easy to work with. Also, it would be nice to have an established workflow to load large data sets consisting of several hundred files into the data lake, rather than having to load each file with HUE.

We can solve the first problem by creating a Hive table that will let us query the data using a variant of SQL; for the second, we can use Talend Open Studio’s excellent data ingestion components. The next post will show how to do this with data on stock closing prices, and provide a framework that you can adapt to load, transform, and analyze other data sets.

If you don’t have a working data lake–that is, if these steps aren’t working for you, or you’re running into a problem you can’t resolve–feel free to use the form below to let me know, and I’ll do my best to help if I can.

Things not working as they should? Let me know about it!!

Connecting Talend Open Studio to Cloudera Quickstart

Connecting Talend Open Studio to Cloudera Quickstart

Cloudera’s Quickstart virtual machine is an excellent way to try out Hadoop without going through the hassle of configuring the packages yourself: it’s like the “live CD” of the Hadoop world. Even in real-world scenarios, the freedom the develop within a virtual environment streamlines the development process, and makes it safe to experiment: if something goes wrong, you can just revert to the last snapshot where things actually worked.

Talend’s set of products are well-suited for designing ETL workflows between Hadoop and related technologies, and Talend provides a free version of their flagstone product, Talend Open Studio. In this post, I’ll show you how to set up Talend Open Studio to access the Cloudera Quickstart virtual Hadoop cluster.

Before You Begin

For this project, you’ll need enough memory to run Talend Open Studio, VirtualBox, and the Cloudera VM simultaneously. You’ll need to assign the VM at at least 8 GB of RAM to do anything useful with it: Open Studio and Virtual Box will take another 1 GB simply to load, and on my Windows 10 laptop, Windows itself seems to take another 1 GB on average. Realistically, you’ll need 16 GB or more of memory to make this work.

Download and Install the Software

You’ll need to download and install the Cloudera Quickstart VM, Talend Open Studio, and VirtualBox. Open Studio and Virtual box are Windows applications, while the Quickstart VM is a virtual machine definition that you will import into VirtualBox. The following links are current as of this post:

VirtualBox has as installer, so just run through it, keeping the defaults unless you have good reason to change them.

The Talend Open Studio download is actually an archive of the entire program: you “install” it by unzipping the archive into whatever directory you want the program to run from. I recommend making a c:\Talend directory and extracting the files there. Do not put it in c:\Program Files or another one of Windows’s privileged folders, because Open Studio relies heavily on third-party libraries, which it downloads and installs as needed. Also, it’s difficult to move the Talend workspace directory after starting your first project, so make sure wherever you extract the program is where you want it to stay.

The Cloudera Quickstart VM is a fully-functional version of CentOS with standard Hadoop packages installed. It also has a version of Cloudera Manager, which you can use to manage the cluster. Cloudera provides the VM for VirtualBox as an *.ovf file, so you don’t need to create a new virtual machine in VirtualBox. Instead, double-click the file, and VirtualBox will open it and present you with a list of configuration settings: change the number of processor cores to 2, and increase the RAM to 8 GB (8192 MB); leave everything else the same.

Connecting Open Studio to the Quickstart VM

Now, open VirtualBox, select the Quickstart VM, and power it on. Then open Talend Open Studio and navigate to Metadata | Hadoop Cluster, then right-click and select Create Hadoop Cluster. You will see the following dialogue box: enter a name for your cluster, fight down the rush of guilt you feel from leaving the Purpose blank, then click Next.

Figure 1: New Hadoop Cluster Connection on repository – Step 1/2

Next, Talend will present you with different ways to enter the cluster’s configuration information. We will import them from the cluster itself, so select “Cloudera” for the Distribution and leave the Version unchanged. Then, set the Option radio button to “Retrieve configuration for Ambari or Cloudera”: this will tell Open Studio to connect to the cluster, instead of you needing to enter the configuration settings by hand. Here is what the dialogue box should look like:

Figure 2: Hadoop Configuration Import Wizard with Correct Selections

Click Next and enter the following settings as Manager credentials:

  • Manager URI (with port): http://quickstart.cloudera:7180/
  • Username: cloudera
  • Password: cloudera

Leave the other fields unchanged, then click Connect, and Open Studio will try to connect to the Quickstart VM: if it is successful, it will show “Cloudera Quickstart” in the Discovered clusters list box. Click Fetch and Open Studio will retrieve the configuration settings for HDFS, YARN, and other Hadoop services. Here is what the dialogue box will look like if this process is successful:

Figure 3: Configuration settings for Cloudera Quickstart VM.

Click Finish to see the settings that Talend Open Studio has imported in its metadata for the Cloudera Quickstart VM. These should all be correct. Under Authentication, be sure the User name is set to cloudera, so the dialogue box looks like this.

Figure 4. Imported Quickstart VM Configuration Settings

This last part is a bit tricky: click on Check Services button, and Open Studio will try to connect to the Quickstart cluster using the configuration settings it just imported. However, it won’t be able to connect. This is because, while the settings are correct, the host machine (i.e. Windows) can’t communicate directly with the guest virtual machine (i.e. the Quickstart VM): instead, the host communicates over a network protocol, just as if the guest was a physically separate machine. VirtualBox mediates this communication by listening to certain IP addresses on the host and routing communication between the host and guest as needed.

default, it listens on localhost (i.e. 127.0.0.1), so we need to modify the Windows hosts file to associate the name quickstart.cloudera with the address 127.0.0.1. This tells Windows to send all networking requests for quickstart.cloudera to the localhost address 127.0.0.1, which is the address that VirtualBox uses to communicate between running virtual machines and the host computer. Here are instructions on how to edit the hosts file on different versions of Windows:

Modify Your Hosts File

After you save the modified file, click on Check Services again. This time, you should see the following dialogue box, indicating that Open Studio can communicate with the Quickstart VM.

Figure 5: Open Studio able to Communicate with Quickstart VM

Conclusion

At this point, you should be able to write jobs in Open Studio using the metadata for the Cloudera Quickstart VM. If you run into errors, follow the usual troubleshooting sequence (e.g. check basic connectivity, firewall rules, etc.); you can also contact me using the form below, and I’ll do my best to help.

Once things are up and running, you need to create a data lake and start ingesting files. The next tutorial shows just how to do that.

Things not working as they should? Let me know about it!