One of the great things about Hadoop is that it’s open source: anyone can download the packages from the official Hadoop website and install them (brave souls can even look at the source code). That said, getting the different Hadoop services, such as the filesystem (HDFS), resource manager (YARN), and data processor (MapReduce), to work together often takes some work. And this work increases if you want Hive, HBase, and the other dozen or so services that form the core processing engine of most real-world clusters.
Because of this, a handful of companies now bundle these services as a single, proven package. As I write this, Cloudera and HortonWorks share most of the market, although MapR is gaining ground fast (no doubt because their own proprietary file system has significant advantages over Hadoop’s native file system). Cloudera, in particular, also has a freely available virtual Hadoop cluster that uses their Cloudera Manager platform, which means that those interested in Hadoop can learn how it works without any upfront cost.
To that end, I’ve written a two-part series on getting started with the Cloudera Manager Virtual Cluster (named the Cloudera Quickstart VM). Both posts use Talend Open Studio, a highly popular data management tool that is also free to download. In the first post, I show how to connect to the Cloudera Quickstart VM using Talend Open Studio, while in the second post, I show how to begin creating a Hadoop data lake, and populating it with data. You can find the first part of the series at the link below; I’ll post the second part on Thursday.