Skip to content

hovig/mapreduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mapreduce

MapReduce relies on <key, value> pairs when mapping. Every move from previous key to the new key is considered a single unique instance of an updated value from its previous state to a new state, reliance will be on the cumulative value.

inputFile ->
    map()<k_origin, v_origin>
    combine()<k_next, value_next>
    reduce()<k_final, v_final>
-> outputFile

Unzip purchases.txt and use it as an input file for the mapper.

Because of the permission issues that I spent time figuring it out and it requires core security changes for .ssh folder path permissions:



I ran the scripts local to Hadoop directory. Instead of feeding the data file to the mapper.py, I had to change sys.stdin and had to open input file within and write results to file as an output.

hadoop jar hadoop-streaming-2.3.0-cdh5.1.0.jar \
    -input myinput \
    -output joboutput \
    -mapper mapper.py \
    -reducer reducer.py \
    -file mapper.py \
    -file reducer.py

OR

cat purchases.txt | python mapper.py | sort -o mapper_output.txt mapper_output.txt | python reducer.py > joboutput

With the same efficiency, mapping/sorting/reducing taking place when running python mapper.py and python reducer.py with the difference of instead of relying on the key to retrieve the final, the scripts will store the results separately and take actions on them.

  • Check out the output results in joboutput for this sample example that finds the total sales values of the toys and consumer electronics:

Toys Total = 57463477.11
Consumer Electronics Total = 57452374.13


Releases

No releases published

Packages

No packages published

Languages