Understanding MapReduce

As Google says, MapReduce is a Programming model and an associated implementation for processing and generating large data sets.  MapReduce is the heart of Hadoop and Big Data analytics and is highly scalable in processing very large amounts of data using commodity PCs.

MapReduce consists of two major tasks (jobs), Map and Reduce, which do all the data processing.  The first task is Map task which takes all the given inputs, fed as key/value pair, and generates intermediate key/value pairs based on the specific functional problem. The second task is the Reduce task which takes Map task’s key/value pair output and combines them to produce the smaller set of final output.

A MapReduce Example:

Say we have a collection of sports documents containing total score of each sports man based on various games he/she played in their career. Now we have to find out the highest score of each sports man. Each document can contain more than one entry for a particular sports man. Score is represented as key value pair in the documents, Sports person name is the key and his/her score is the value.

For Simplicity we take only 3 sportsmen and their score and here is one sample document.

Mike, 100

Joseph, 150

Robert, 200

Mike, 130

Joseph, 90

Robert, 160

We take four more documents the same way and process them using MapReduce. Initially Map job is fed with all these five documents to generate the intermediate key/value pair as output.

Here is the sample result from Map task out of all those five documents.

The result of the first document (given above) is:

(Mike, 130) (Joseph, 150)(Robert, 200)

Please note Map job produced the highest score of each person out of its map task.

Same way all other four map job (not shown for simplicity) produced the following result:

(Joseph, 100) (Robert, 150) (Mike, 130)

(Joseph, 170) (Robert, 110)

(Mike, 120) (Joseph, 155)(Robert, 220)

(Mike, 170) (Joseph, 130)(Robert, 250)

Now all the five map jobs produced the above five results of key, value pair. These five results are now fed to Reduce job to generate the final results. There can be five reduced jobs or lesser than that but it will have capabilities to sync between them to produce one final result. After all the results from Map jobs are processed the Reduce job will produce the following result containing the highest score of each sports person.

(Mike, 170) (Joseph, 170) (Robert, 250)

As the name MapReduce implies, Reduce job is performed always after Map job. Most importantly all these functionalities are done parallel in a clustered environment with commodity machines. Meaning in our example, first five map jobs are done parallel to produce five results and these five results are fed into the Reduce job which will again run parallel to produce the final result. Because of this high degree of parallelism and scalability any amount of data can be processed and generated using this framework in a considerably good amount of time.

In a real MapReduce environment, MapReduce is a software framework with great interface to easily plug in this Map and Reduce functionalities to the framework. Usually user is given option to write his own Map and Reduce logic and pass it over to the framework to use this Map and Reduce logic for processing the input data and produce output data. This software framework hides most of the complexity of parallelization, fault-tolerance, data distribution and load balancing in the library and let the user to focus only on the Map and Reduce functionalities.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s