This post talks about the Map Reduce Framework. Google has better documentation about Map Reduce, but this post is for the n00bs by a n00b.
- Map Reduce Introduction
- Explaination of different phases in Map Reduce using Example : Word count problem (pseudo code implementation in Javascript)
Problem
Consider 1000 documents and you are asked to count the occurrences of every word in these document. In the end you should have a simple table giving you the following.
"word" => "number of times it has occurred in our 1000 documents"
Solution
So how would you solve the above problem ..
- Create a
Hash[Key] = Value.. where Key = word and Value = 1 (start off with 1) - Load contents of File in memory
- For every word,
Hash[word] = Existing_Value+1 - Goto Step 2 and continue for the next 999 documents
- And Voila ! We have our result !
What is a Hash Table ?
Hash Table (a.k.a Hash, Map, Key Value pair) is a Data Structure similar to an Array. The advantage of a Hash is that values are stored in “key” as index instead of traditional numerical index as in Array.
The above works fine .. but imagine the time taken to solve the problem if there were more than billion documents. Damn !
To optimize the problem, we can do processing over 10 computers (a.k.a workers, nodes), by transferring 100 documents to each computer and let them do the processing simultaneously.. once complete, we can retrieve the results from each and merge the results to get our output ! This approach reduces the time by a factor of 10 ! i.e if we add more computers .. we can reduce the time factor by n [number of workers]
Google indexes billions of documents each do (something similar to our problem). So they introduced a framework to aid in processing such large data intensive problems … aka Map Reduce Framework
Meet Map Reduce
Map Reduce is all about dividing a large data problem into small independent sub problems. The o`l Divide and Conquer technique.
Design Issues addressed by Map Reduce Framework [Referenced from Data-Intensive Text Processing with MapReduce]
- How do we break up a large problem into smaller tasks? More specifically, how do we decompose the problem so that the smaller tasks can be executed in parallel?
- How do we assign tasks to workers distributed across a potentially large number of machines (while keeping in mind that some workers are better suited to running some tasks than others, e.g., due to available resources, locality constraints, etc.) ?
- How do we ensure that the workers get the data they need?
- How do we coordinate synchronization among the different workers?
- How do we share partial results from one worker that is needed by another?
- How do we accomplish all of the above in the face of software errors and hardware faults?
Mapper -> Partioner -> Reducer ?
Now lets see, how we can solve our above problem using Map Reduce.
Map Reduce primary consists of 2 primary phases i.e
- Mapping Phase
- Reducing Phase
There is an intermediate phase between the Mapping and Reducing phase called the partition phase.
Map Reduce Architecture
Map Reduce Functions Pseudo code
Basic Data Structure
To be consistent across different phases, there must be a common way of representing data, and the basic data structure for the Map Reduce framework is Key => Value pair.
Map Phase
Mapping Phase consists of mapping the incoming data which is in a Key => Value pair into an intermediate Key => Value pair after applying some business logic. In the above example, the incoming data is a Key => docId, Value => Contents of document.
map: (k1,v1) -> [(k1,v1)]
In our example, we will employ the following business logic.
- Sanitize the incoming words, i.e remove all special characters.
- convert all the incoming words to lower case
- If word encountered for first time then value for that word = 1, if encountered many times, then increment value that many times. [This is an optimization]
At the end of the mapping phase, Intermediate Key => Value pairs are formed.
var Input = {};
inputHash["doc1"] = "The quick brown fox jumped in the well.";
inputHash["doc2"] = "The slow brown fox did nothing in the well";
inputHash["doc3"] = "The black fox ran near the well.";
// Map Function
function map(docId, docContents) {
var outputHash = {}
, docWords = docContents.split(" "); // split the document into array of words
for(var index in docWords) {
// get word from array
var word = docWords[index];
// Clean and sanitize word
// We don`t want ".", ",", ":" .. special chars and
// Convert all words to lower case
word = word.replace(/[^a-zA-Z 0-9]+/g,'').toLowerCase();
// This checks whether word already exists
// then increments the count
// else 1
outputHash[word] = (outputHash[word] || 0)+1;
}
return outputHash;
}
/*
OUTPUT for docId = doc1
"the" => 2;
"quick" => 1;
"brown" => 1;
"fox" => 1;
"jumped" => 1;
"in" => 1;
"well" => 1;
*/
According to our example we would have list of words with its corresponding occurrence values for that particular document passed to the mapping function. [Refer Output in above Code]
Note :
- Map Task is mandatory !
- Map Task is as fast as the slowest map task.
- Map Task across all workers need to be complete before moving on the next phase.
Partition
This phase kicks in once the Mapping phase is complete. The partitioner determines which reducer will be responsible for processing a particular key, and the execution framework uses this information to copy the data to the right location during the shuffle and sort phase. Values are grouped and sorted across workers for every Key.
Intermediate Keys are forwarded to Reduce phase.
Reduce
As the name suggests, a reducer function will reduce (based on function) from the values for a particular key.
// Reducer Function
function reduce (key, values)
// Reducer Function
for (var index in values) {
var value = values[index];
value += value; // add up values
}
return {key: value};
}
/*
OUTPUT for key fox
"fox" => 3
*/
In our example, the reducer will be given a key and its set of values, and the above function will add up all the set of values and output a value (total of all set of values) for that particular key. [Refer Output in above Code]
Note: The reduce computation cannot start until all the mappers have fin- ished emitting
Key=>Valuepairs and all of them are grouped and sorted.
Applications of Map Reduce
According to Google’s Slide
- distributed grep
- distributed sort
- web link-graph reversal
- term-vector per host
- web access log stats
- inverted index construction
- document clustering
- machine learning
- statistical machine translation
For more applications, check this post out Known applications of MapReduce
Conclusion
This was a very basic overview about Map Reduce Framework. I have not discussed any internals of the framework, but will be talking in more detail in the forthcoming posts.
References
The images were taken from Data-Intensive Text Processing with MapReduce