Untitled

1. MapReduce Overview

Context: multi-hour computations on multi-terabyte data-sets
Overall goal: easy for non-specialist programmers

Programmers just need to define map and reduce functions, and the MR will manage, and hide all aspects of distribution

2. Abstract of View

Input1 -> Map -> a,1 b,1
Input2 -> Map ->     b,1
Input3 -> Map -> a,1     c,1
                  |   |   |
                  |   |   -> Reduce -> c,1
                  |   -----> Reduce -> b,2
                  ---------> Reduce -> a,2

Input is (already) split into M files
MR calls Map() for each input file, produces set of k2,v2
- "intermediate" data
- each Map() call is a "task"
When Maps are done
1. MR gathers all intermediate v2's for a given k2,
2. and passes each key + values to a Reduce call
Final output is set of <k2,v3> pairs from Reduce()s

3. Details

3.1 Hidden Details

Sending app code to servers
Tracking which tasks have finished
"Shuffling" intermediate data from Maps to Reduces