By Michael Burke | September 30, 2019
In my last post, we talked about Hadoop’s secret sauce–clustering–that enables it to handle massive datasets, up to ‘Yottabytes’ (yeah, that’s a real thing, it’s a septillion bytes). In this post we’ll talk about another main ingredient in Hadoop’s secret sauce: MapReduce.
Hadoop uses a distinct technique to process data, or get it into usable form, called MapReduce. Unlike many other aspects of computer science, MapReduce does what it says–it “maps” out the data that you need, and then “reduces” it to a form that you can use. The ‘mapping’ collects the data that is relevant to your query or other action, which has been sliced apart and stored on various nodes of your cluster (a fancy way of saying that your data is stored on a bunch of different computers, and Hadoop fetches it for you). Meanwhile, the ‘reduce’ portion takes that mapped out data and spits out whatever it was that you specifically had asked for.
Example: Mapping and Reducing all the Johns in San Jose
So if I wanted to find out how many people named “John” there are in the San Jose phone database, Hadoop would “map” all the Johns across all the nodes, and then “reduce” that to a calculation of the total number of Johns. It’s important that the data itself doesn’t change during this process, only our view of it changes. And that’s part of the advantage of Hadoop in general–we’re getting a very simplified view of data that is actually dispersed in very complex ways.
In old school computing, stuff gets done sequentially–one cook in the kitchen (otherwise known as a CPU) working really fast. But MapReduce is part of a new generation of technologies that carry out multiple tasks in parallel–coordinating many cooks in the kitchen to get things done much faster than if they had to be conducted sequentially.
Will the real MapReduce Please Stand?
One last point of clarification–there are two kinds of MapReduce! There’s the kind of processing paradigm that we’ve been talking about–which you’ll often see written as map+reduce–and then there’s an actual software layer in Hadoop called MapReduce that carries out these functions. The reason it’s important to differentiate the two is that there are now other kinds of software that can do the mapping+reducing, and many of them are much better than the original MapReduce, such as Tez.
Related blog topics:
Making Sense of the Wild World of Hadoop
Databases rule the world–but what the heck are they, anyway?
What is Virtualization Anyway, and What Are Its “Virtues”?
What Exactly is Big Data, Anyway?
Removing “The Cloud” from the Mysterious Ether