Cloudera’s Virtual Hadoop Cluster, Part II: Creating a Data Lake

On Monday, I wrote about how to install and setup Cloudera’s Virtual Hadoop Cluster, which is named the Cloudera Quickstart VM. After the cluster is up and running, the next step is to build a data lake so that you can import data to analyze. You can find guidance on how to do that at Part II: Building a Basic Data Lake for the Cloudera Quickstart VM. If you haven’t finished installing the Quickstart VM, you can find a step-by-step guide here.

Both Part I and Part II have a contact form at the bottom: if you get stuck, feel free to reach out, and I’ll do my best to help.

Try Out a Virtual Hadoop Cluster at Home for Free

One of the great things about Hadoop is that it’s open source: anyone can download the packages from the official Hadoop website and install them (brave souls can even look at the source code). That said, getting the different Hadoop services, such as the filesystem (HDFS), resource manager (YARN), and data processor (MapReduce), to work together often takes some work. And this work increases if you want Hive, HBase, and the other dozen or so services that form the core processing engine of most real-world clusters.

Because of this, a handful of companies now bundle these services as a single, proven package. As I write this, Cloudera and HortonWorks share most of the market, although MapR is gaining ground fast (no doubt because their own proprietary file system has significant advantages over Hadoop’s native file system). Cloudera, in particular, also has a freely available virtual Hadoop cluster that uses their Cloudera Manager platform, which means that those interested in Hadoop can learn how it works without any upfront cost.

To that end, I’ve written a two-part series on getting started with the Cloudera Manager Virtual Cluster (named the Cloudera Quickstart VM). Both posts use Talend Open Studio, a highly popular data management tool that is also free to download. In the first post, I show how to connect to the Cloudera Quickstart VM using Talend Open Studio, while in the second post, I show how to begin creating a Hadoop data lake, and populating it with data. You can find the first part of the series at the link below; I’ll post the second part on Thursday.

Happy programming!

Part I: Downloading Cloudera Quickstart and Getting Connected

Hadoop: A Brief Overview

If you work with technology, you’ve probably heard of Hadoop: based on a 2003 white paper, Hadoop is the technology Google used to index the Internet. Today, it is freely available through the Apache software foundation, and is used by companies from Silicon Valley superstars like Amazon and Facebook to more traditional businesses, such as Home Depot, Angie’s List, and Verizon.

These companies use Hadoop because it lets them store very large data sets, and analyze them quickly. Consider that Google must provide accurate search results within a few seconds to users all over the world, and you can see that value that capability provides.

I’ve written a brief overview of this revolutionary technology, explaining the history of Hadoop, and how its two main components (the Hadoop Distributed File System, and the MapReduce algorithm) work. After reading this, you’ll have a better understanding of the most important technology in the Big Data landscape.

You can find the overview here:

Hadoop Basics: What is It, and How Does it Work?