Sunday, March 18, 2012

MapReduce For Dummies

I dumbed down the MapReduce functional programing model (for my required purposes) to a critically small set of resources and summary page.

Google Code University's MapReduce

Hadoop MapReduce Tutorial

Download the PDF here (much better)  that contains the diagrams, txt only portion follows.
MapReduce for Dummies

--------------------

Map Reduce for DUMMIES!

Map or f1(arg)->list: Takes single K,V pair and outputs a list with single K,V pair, multiple KV pairs, or none.  Input Key and Output Key need not be the same and are generally not!

Combine: If reduction (f2) is commutative and associative, i.e. doesn’t care which order it sees thing in. Generally this means result is same type as inputs which are also the same type (see f2 below) or a combine f(x) is built to handle differences. Do this if you have multiple identical Key here (can speed up reduction).

Sort:  Same K value are sent to the same reduce process, thus requiring K reduce processes running in parallel.  Done by mapper so this is being done distributed fashion.

Group: Key, Value lists! Keys from Sort with list of Values  (reduction is done per key)!

Reduce or f2(newValue, oldValue)->  nextOldValue, due to Group step, reduction is done per Key. Output is a Key and the accumulated values which are then passed back to be combined.


Credit (for everything correct) goes to UC Berkley’s Brian Harvey, PhD. for his excellent CS undergraduate lecture series.

http://www.cs.berkeley.edu/~bh/

------------------

You may require to logically join tables in Hadoop, how to do it?

MapReduce for Table Joins

A =Topic / B =DocEntityID / C = EntryDate / R = PLSA / S = MD

Join two relational sets with common key
R(A,B) x S(B,C)

Combine tuples from R & S that have same B value.

Map Process: 1
R tuple (a,b) -> (b,{a R})
(001,1234AX)   ->   (1234AX, {001,PLSA})
(002,1234AX)   ->   (1234AX, {002, PLSA})
(003,1234AX)   ->   (1234AX, {003,PLSA})
(001, 3279VC)  ->   (3279VC, {001,PLSA})
…        
Map Process: 2
S tuple (b,c)  -> (b,{c,S})
(1234AX, 20120316)  -> (1234AX, {20120316,MD})
(3279VC, 20120120)  -> (3279VC, {20120120,MD})
Sort Process for Map1:
1234AX,
            {001,PLSA}
            {002,PLSA}
            {003,PLSA}
              …
3279VC,
{001,PLSA}
               ...
Sort Process  for Map2:
1234AX, {20120316,MD}
3279VC,  {20120120,MD}
        … ,          …

Group1:
1234AX:
            {001,PLSA}
            {002,PLSA}
            {003,PLSA}
            {20120316,MD}

Reduce Process1:
 Store every value  under this key Key,List<Value>
   
Group2:
3279VC:
{001,PLSA}
{002,PLSA}
{003,PLSA}
{20120120,MD}

Reduce Process 2:
Store every value under this key Key,List<Value>


Credit (for anything correct) goes to Stanford’s Jeffery Ullman and Foto N. Afrati and their brilliant paper along with all credits and acknowledgements contained therein, “Optimzing Joins in a MapReduce Environment”.

No comments:

Post a Comment