Connecting Talend Open Studio to Cloudera Quickstart

Cloudera’s Quickstart virtual machine is an excellent way to try out Hadoop without going through the hassle of configuring the packages yourself: it’s like the “live CD” of the Hadoop world. Even in real-world scenarios, the freedom the develop within a virtual environment streamlines the development process, and makes it safe to experiment: if something goes wrong, you can just revert to the last snapshot where things actually worked.

Talend’s set of products are well-suited for designing ETL workflows between Hadoop and related technologies, and Talend provides a free version of their flagstone product, Talend Open Studio. In this post, I’ll show you how to set up Talend Open Studio to access the Cloudera Quickstart virtual Hadoop cluster.

Before You Begin

For this project, you’ll need enough memory to run Talend Open Studio, VirtualBox, and the Cloudera VM simultaneously. You’ll need to assign the VM at at least 8 GB of RAM to do anything useful with it: Open Studio and Virtual Box will take another 1 GB simply to load, and on my Windows 10 laptop, Windows itself seems to take another 1 GB on average. Realistically, you’ll need 16 GB or more of memory to make this work.

Download and Install the Software

You’ll need to download and install the Cloudera Quickstart VM, Talend Open Studio, and VirtualBox. Open Studio and Virtual box are Windows applications, while the Quickstart VM is a virtual machine definition that you will import into VirtualBox. The following links are current as of this post:

VirtualBox has as installer, so just run through it, keeping the defaults unless you have good reason to change them.

The Talend Open Studio download is actually an archive of the entire program: you “install” it by unzipping the archive into whatever directory you want the program to run from. I recommend making a c:\Talend directory and extracting the files there. Do not put it in c:\Program Files or another one of Windows’s privileged folders, because Open Studio relies heavily on third-party libraries, which it downloads and installs as needed. Also, it’s difficult to move the Talend workspace directory after starting your first project, so make sure wherever you extract the program is where you want it to stay.

The Cloudera Quickstart VM is a fully-functional version of CentOS with standard Hadoop packages installed. It also has a version of Cloudera Manager, which you can use to manage the cluster. Cloudera provides the VM for VirtualBox as an *.ovf file, so you don’t need to create a new virtual machine in VirtualBox. Instead, double-click the file, and VirtualBox will open it and present you with a list of configuration settings: change the number of processor cores to 2, and increase the RAM to 8 GB (8192 MB); leave everything else the same.

Connecting Open Studio to the Quickstart VM

Now, open VirtualBox, select the Quickstart VM, and power it on. Then open Talend Open Studio and navigate to Metadata | Hadoop Cluster, then right-click and select Create Hadoop Cluster. You will see the following dialogue box: enter a name for your cluster, fight down the rush of guilt you feel from leaving the Purpose blank, then click Next.

Figure 1: New Hadoop Cluster Connection on repository – Step 1/2

Next, Talend will present you with different ways to enter the cluster’s configuration information. We will import them from the cluster itself, so select “Cloudera” for the Distribution and leave the Version unchanged. Then, set the Option radio button to “Retrieve configuration for Ambari or Cloudera”: this will tell Open Studio to connect to the cluster, instead of you needing to enter the configuration settings by hand. Here is what the dialogue box should look like:

Figure 2: Hadoop Configuration Import Wizard with Correct Selections

Click Next and enter the following settings as Manager credentials:

  • Manager URI (with port): http://quickstart.cloudera:7180/
  • Username: cloudera
  • Password: cloudera

Leave the other fields unchanged, then click Connect, and Open Studio will try to connect to the Quickstart VM: if it is successful, it will show “Cloudera Quickstart” in the Discovered clusters list box. Click Fetch and Open Studio will retrieve the configuration settings for HDFS, YARN, and other Hadoop services. Here is what the dialogue box will look like if this process is successful:

Figure 3: Configuration settings for Cloudera Quickstart VM.

Click Finish to see the settings that Talend Open Studio has imported in its metadata for the Cloudera Quickstart VM. These should all be correct. Under Authentication, be sure the User name is set to cloudera, so the dialogue box looks like this.

Figure 4. Imported Quickstart VM Configuration Settings

This last part is a bit tricky: click on Check Services button, and Open Studio will try to connect to the Quickstart cluster using the configuration settings it just imported. However, it won’t be able to connect. This is because, while the settings are correct, the host machine (i.e. Windows) can’t communicate directly with the guest virtual machine (i.e. the Quickstart VM): instead, the host communicates over a network protocol, just as if the guest was a physically separate machine. VirtualBox mediates this communication by listening to certain IP addresses on the host and routing communication between the host and guest as needed.

default, it listens on localhost (i.e. 127.0.0.1), so we need to modify the Windows hosts file to associate the name quickstart.cloudera with the address 127.0.0.1. This tells Windows to send all networking requests for quickstart.cloudera to the localhost address 127.0.0.1, which is the address that VirtualBox uses to communicate between running virtual machines and the host computer. Here are instructions on how to edit the hosts file on different versions of Windows:

Modify Your Hosts File

After you save the modified file, click on Check Services again. This time, you should see the following dialogue box, indicating that Open Studio can communicate with the Quickstart VM.

Figure 5: Open Studio able to Communicate with Quickstart VM

Conclusion

At this point, you should be able to write jobs in Open Studio using the metadata for the Cloudera Quickstart VM. If you run into errors, follow the usual troubleshooting sequence (e.g. check basic connectivity, firewall rules, etc.); you can also contact me using the form below, and I’ll do my best to help.

Once things are up and running, you need to create a data lake and start ingesting files. The next tutorial shows just how to do that.

Things not working as they should? Let me know about it!