Hadoop Basics: What is It, and How Does it Work?
Today, organizations can centralize their data with Hadoop using familiar tools such as SQL and Python. However, it’s hard to take the first step unless you understand how Hadoop works. In this post, I’ll recount Hadoop’s origin story and the problem it was designed to solve. After that, I’ll discuss its two main parts:
- The Hadoop Distributed File System, which stores large data sets
- The MapReduce algorithm, which quickly analyzes them
After reading this post, you’ll have a better understanding of how Hadoop works, and the types of problems it is designed to solve. You can read more about these problems (and how they can damage your business) on this page. You can also arrange a free consultation to discuss your specific needs, and design a roadmap to meet them.
Why It’s Called Hadoop
Hadoop is the name of a stuffed elephant that was much beloved by the son of Doug Cutting, one of Hadoop’s earliest designers. You can see both Mr. Cutting and the toy elephant here .
Programmers like in-jokes, and animal references are a recurrent theme. Because of this, Hadoop’s symbol is an elephant, and its components often refer to animals:
- Hive, which lets you query Hadoop like a traditional database, using SQL
- Pig, which analyses unstructured data, such as comments on a web forum
- Mahout, which builds predictive models based on Hadoop
Pig gets its name because “pigs will eat anything”. Mahout, which builds machine-learning models, is the Indian term for “elephant trainer”: hence, Mahout trains the Hadoop elephant.
Why Hadoop Was Invented
Hadoop began with a Google white paper on how to index the World Wide Web to provide better search results. Today, Google is so dominant that we forget Yahoo!, AltaVista, ChaCha (!), and AskJeeves (!!!) were once genuine competition.
To get there, though, they had to catalogue millions of websites and direct users to them in real time. Their programmers quickly learned that databases like Oracle and SQL Server were too slow.
To see why, imagine opening a very large Excel spreadsheet: it takes so long you may question your decision to open it at all. This is because large files have long loading times. Hadoop was invented to load and analyze large data sets (like an index of all web sites) that traditional databases can’t handle.
There are two parts to Hadoop: the Hadoop file system, and the MapReduce algorithm. We’ll review each in turn.
How Hadoop Works, Part I: The File System
One solution to long loading times is to store data as many smaller files: for example, keeping each year of customer data in a separate file. This can work, but it also breaks up your data set. Imagine if we could store the data as many small files, but work with it as one large file.
This is what Hadoop’s Distributed File System (HDFS) does. The user sees one file, but “behind the scenes” HDFS stores the data as hundreds of smaller files. It also does all the bookkeeping, so that the only difference to the user is that the entire data set loads much faster. See the following graphic:
In technical terms, HDFS overlays the operating system’s own file system. Since Hadoop runs on Linux, the actual data is stored as a group of Linux files. We call HDFS a logical file system, because it looks like the data set is stored in one large file even when it is actually spread out among many different files.
This isn’t a new idea: computers already store larger files as a sequence of multiple blocks. This is done to use as much of the disk as possible, by filling in any “holes” left after files are deleted. As with HDFS, the user doesn’t see any of this, because the operating system (e.g. Windows) keeps track of the blocks belonging to each file.
How Hadoop Works, Part II: The MapReduce Algorithm
After loading the data, we want to do things with it. For example, Google’s search needs to find the best websites for each user: how is this possible when the data is split into hundreds of small files?
The answer is Hadoop’s MapReduce algorithm. Imagine your data is spread out in hundreds of small files. You don’t know which file stores which data. Because of this, you must analyze all the files: for example, searching every file for the right web site.
However, searching the files one by one would take too long. Instead, MapReduce searches them all simultaneously. See the following graphic:
Technically, this type of search is called a map: the user defines a function and applies it to a data set to get a new data set. Here, the function just returns TRUE if the record contains the name the user is searching for, and FALSE otherwise.
Fortunately, that’s not all a map can do: for example, it might return a 1 for every word in a document. Then, the program has a data set with one column listing ever word, and another containing ones:
The program could pass this file to a reducer that totals the second column for each word. The reducer’s output would count how many times each word appeared in the document:
By using both a mapper and a reducer, Hadoop’s MapReduce algorithm can do almost any type of analysis or calculation. However, this type of program can be challenging to write. For example, assume we have two files of customer names. We want to sort all six names in alphabetical order:
File 01: Ackerman Codd Shannon File 02: Kernigan Ritchie Stroustrup
We can sort each file by itself, but we won’t end up with all six names in the correct order. To do that requires some understanding of the mathematics of how mapper and reducer functions interact.
This poses a challenge, because most programmers aren’t mathematicians. The good news is that there are tools to generate MapReduce code from more familiar languages, such as SQL: I’ll explore these in a future post. In the meantime, if you have any questions or comments on this post, please fill out the contact form below, and I’ll do my best to answer them.