Cloudera Quickstart: Building a Basic Data Lake and Importing Files
In an earlier post, I described how to install Cloudera’s Quickstart virtual Hadoop cluster and connect to it with Talend Open Studio:
Now, I’ll describe how to configure your cluster so you can easily load data into it. In a future post, we’ll use Talend Open Studio to import stock market data and prepare it for later analysis.
First, we need to transfer the data from your host computer into the guest machine running inside Virtual Box: we’ll enable File Sharing in Virtual Box to do this. Once that’s done, we can import the data into the virtual cluster itself. Along the way, we’ll create directories within the Hadoop file system to store the data, and set the correct file ownership and permissions.
Before You Begin
Before following these instructions, you need to have a working Cloudera Quickstart cluster. This post shows how to set this up (connecting to Talend Open Studio isn’t required for this post, but will be necessary later on). Also, we’ll be using the Linux terminal quite a bit, because it makes the job significantly easier. This page gives a comprehensive overview of how the terminal works, and is a good reference if you get stuck.
A Brief Overview of the Hadoop File System
Hadoop manages files with the Hadoop Distributed File System (commonly abbreviated as HDFS). This is a logical file system that overlays the actual file system used by the operating system. Since Hadoop runs on Linux, all files in HDFS are ultimately stored by Linux as regular Linux files. However, instead of storing a huge data set as just one (Linux) file, HDFS stores it as many smaller files, even though to the user it still looks like one big file. Since it’s much quicker to load and process many smaller files than one big one, HDFS let you load and analyze large data sets much more quickly than a traditional database without needing to deal with all the smaller files yourself.
The drawback is that you can’t deal with the data set directly, either through the Linux terminal, or a GUI such as KDE or Gnome. Instead, you must use the specialized commands provided by Hadoop. These are modeled after Linux terminal commands, and will be familiar if you have used the terminal before: for example, the
ls lists files in a directory within HDFS. Instead of running the commands directly, however, you pass them as parameters to Hadoops’s
hdfs command. The syntax is:
hdfs dfs -command [arguments]
hdfs tells Linux that we’re working within Hadoop’s file system, while
dfs means that we are issuing a disk file system command (technically, the command
hdfs invokes the
hdfs service, which lets you perform a wide range of file system and administration tasks). The -command parameter is a file system command such as
ls, and the [arguments] are things like the directory to list, or the files to copy. For example, to list the files in the root directory on a Hadoop cluster, you can type:
hdfs dfs -ls /
At least, that’s how it works in theory: in practice, you usually need to run the
hdfs command as another user who has the appropriate permissions on the Hadoop file system. By default, directories in HDFS are owned by the
hdfs user: this is a Linux service account that Hadoop allows to have access to its file system. You can change directory ownership to another user, but during initial setup, you’ll need to run your Hadoop file system commands as the
hdfs user. You do this with the Linux command
sudo: for example, to list the files in the root directory on a Hadoop cluster with default permissions, you’ll need to type:
sudo -u hdfs hdfs dfs -ls /
Let’s step through this: the
sudo -u hdfs part tells Linux to run the following command as the user
hdfs. In this case, that command is also called
hdfs, and we pass it the parameters
dfs -ls / to tell it to list the root directory. This is admittedly confusing: just remember that
hdfs is in the command string twice because it is both the name of the command and the name of the user running the command. Contrast this with listing a directory owned by the user
sudo -u cloudera hdfs dfs -ls /
We’ll discuss how Hadoop permissions and Linux permissions interact in more detail later in this post. For now, I’ll just say that whenever anything goes wrong on a Hadoop cluster under my administration, the first thing I check is whether the executing process has the correct permissions. Also, you can find documentation on the Hadoop file system commands on this page:
Step One: Moving Files From the Host Computer to VirtualBox
Before we can move files to the Hadoop cluster, we need to get them from the host computer into the guest machine running on Virtual Box. The most straightforward way to do this is to set up File Sharing in VirtualBox. To do this, first start the Quickstart VM, then open Machines | Settings | Shared Folders, and you’ll see this dialogue box:
Click the folder icon that has a “+” on it (on the right border of the dialogue box) and browse to the folder that you want to share with the guest machine. I like to create a folder just for this purpose called VirtualBoxShare. After you’ve selected the folder, check the Auto-mount and Make Permanent boxes, then click OK. Here is what the dialogue box should look like.
Now, even though you selected auto-mount, you still need to mount the shared folder yourself the first time. To do this, open a Terminal window and type the following commands:
sudo VBoxControl sharedfolder list
This tells VirtualBox to list the folders on the host machine that are available for sharing; you should see the name of the folder you selected. This is the name that VirtualBox has assigned to this folder: you’ll use this name with the
mount command, rather than the full Windows path (which is what I, at least, expected). To mount the folder, type the following commands:
mkdir share sudo mount -t vboxsf VirtualBoxShare ~/share
If all goes as planned, this should mount your selected folder (VirtualBoxShare in this example) in the directory
~/share. You should now be able to access files stored in the VirtualBoxShare folder on the host machine through the
~/share directory on the guest machine. You can test this by moving a file into the VirtualBoxShare folder, then listing the
~/share directory. Here is what this looks like on my machine.
Step Two: Moving Files from the Guest Machine to the Hadoop Cluster
Once file sharing is set up, you can access files from the guest machine within Virtual Box through the
~/share folder. The next step is to load them into the Hadoop cluster itself (recall that Hadoop uses its own file system that overlays the Linux file system). Hadoop provides the
copyFromLocal command to do this: however, since we have access to Cloudera Manager, we can use it instead of the Terminal. First, though, we need to create a directory structure within the Hadoop file system and assign the correct file permissions, and that is best done from the terminal.
There are not just entire books, but entire collections of books on data warehouse design (Kimball is my personal favorite). Here, I’m going to present a simple architecture that can be easily extended to accomodate a wide variety of data. We will do all our work as the user cloudera, and store all our data in the
/user/cloudera/datalake directory. This directory will have one will have 3 subdirectories called ingest, stage, and final; in turn, each of these will have one subdirectory for each data set we are working with. Here is how this looks for the stockdata data set that we will work with later on:
As can be seen, the contents of the
/user/cloudera/datalake/ingest/stockdata directory are the actual stock data files themselves. In this case, these files are named after the stock ticker symbol, and are *.csv files that contain daily market data on each stock. Cloudera Manager lets you view the files just by clicking on them: here is what AAPL (Apple) looks like:
From now on, we’ll refer to the directories in the cluster where we store our data as the data lake. The first step in building the data lake is to create the directory structure: for that, we’ll use the
mkdir command provided by HDFS. Open a Terminal window and type the following:
sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/ingest sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/stage sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/final
This will create the basic architecture. We will copy the data from the guest machine (i.e. the Linux file system) into the ingest folder: this will be a straight copy, with no validation or transformations. From there, we’ll copy the data from the ingest directory to the stage directory, applying any data validation and transformations in the process. Finally, we will copy the data from the stage directory to the final directory: this will be the data used in production work, for reporting and analysis. Of course, we can also analyze the raw data itself: for example, we can build a machine learning model that uses features of the data set that we don’t need for reporting.
The data set well we working with is called stockdata, so go ahead and create subdirectories for this under ingest, stage, and final.
sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/ingest/stockdata sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/stage/stockdata sudo -u hdfs hdfs dfs -mkdir /user/cloudera/datalake/final/stockdata
At this point, we have created the directories we need to load and transform the stock data. You can verify this using Cloudera Manager by opening the Hue service from the main screen, then clicking on Hue Web UI in the menu bar, followed by File Browser. This will open a GUI browser showing the folders in
/user/cloudera. Navigation is as you’d expect: just click on a folder to see its contents.
There’s one more step before we can upload data into the Hadoop cluster. If you look at the directories you created, you’ll see that they are all owned by the user hdfs: this makes sense, because we created them as that user. However, if you try and upload data through the Hue File Browser, you’ll get a permissions error, because you are logged in as cloudera, and don’t have write access to the folder. There are a few different ways to resolve this, but first, let’s talk a bit about rights management in Hadoop and Linux.
A Brief Overview of Hadoop (and Linux) Permissions
Since Hadoop is a set of services that run on Linux, Hadoop permissions are ultimately resolved into Linux permissions; that said, for practical purposes, Hadoop and Linux are two distinct domains. You can have permission to access a file on Linux, but not be able to access it through HDFS (and vice versa). Furthermore, the Linux root user has no special privileges within the Hadoop domain: HDFS treats it just like any other user. This means that even logging in as root is not guaranteed to give you access to a file within HDFS.
Both Linux and HDFS set permissions on individual files (and directories, which are considered files within both domains). File permissions determine whether the file can be read, modified, or executed by the file’s owner, users that belong to the file’s group, and everyone else. In Linux, file permissions are set by the
chmod command, and file ownership by the
chown command; the corresponding commands on HDFS have the same names, but are passed as parameters to the
hdfs command (just like other HDFS commands). To change the permissions of the file myFile so that it can be read by everyone, you type:
sudo -u hdfs hdfs dfs -chmod +r myFile
Similarly, to change the owner and group of the file to cloudera, you type:
sudo -u hdfs hdfs dfs -chown cloudera:cloudera myFile
For the exact syntax, you can refer to this page on Understanding Linux File Permissions.
The challenge with Hadoop permissions is that most file system commands you execute through the console need to run under the hdfs user: however, you should run reporting and analysis jobs as another user (such as cloudera) that has a login shell and password. One solution is to give cloudera access to the supergroup group, which is the group that files owned by HDFS belong to by default. To do this, you need to create supergroup as a Linux group, then add cloudera to it. This will work, but I prefer to change the ownership of files that will be used by cloudera to that user. For our purposes, this includes every file stored in the
/user/cloudera/datalake directory (i.e. the data lake proper). You can do this with the command:
sudo -u hdfs hdfs dfs -chown -R cloudera:cloudera /user/cloudera/datalake
This is just like the
chown command shown earlier, with one crucial difference: the
-R switch will change the ownership for all files in the directory and its subdirectories. After you run it, you will be able to access and analyze all files in the data lake as the user cloudera from both the Terminal and the HUE File Manager.
Hadoop permissions don’t stop there: Cloudera Manager has it’s own permissions that are separate from both Linux and HDFS. Rather than being set on individual files, these are user-based, similar to database permissions that restrict which users can create and modify the data. By default, cloudera has Full Administrator permissions, so you shouldn’t need to modify these for this series of posts: however, you can learn more about them at Cloudera’s documentation if you are interested.
Loading Data into the Cluster with the HUE File Manager
Finally, you can load data into the data lake using the HUE File Manager. To do this, open the Hue service from the Cloudera Manager main screen, then click on Hue Web UI in the menu bar, followed by File Browser. After you’ve done that, click the Upload button in the upper-right corner to open the following dialogue box:
Navigate to the VirtualBoxShare folder, select the file, and voila!
At this point, you should have a working data lake and the means to load data into it. However, the data isn’t in a form that’s easy to work with. Also, it would be nice to have an established workflow to load large data sets consisting of several hundred files into the data lake, rather than having to load each file with HUE.
We can solve the first problem by creating a Hive table that will let us query the data using a variant of SQL; for the second, we can use Talend Open Studio’s excellent data ingestion components. The next post will show how to do this with data on stock closing prices, and provide a framework that you can adapt to load, transform, and analyze other data sets.
If you don’t have a working data lake–that is, if these steps aren’t working for you, or you’re running into a problem you can’t resolve–feel free to use the form below to let me know, and I’ll do my best to help if I can.