TF-IDF-Map-Reduce

Overview

The subject TF-IDF (Term Frequency-Inverse Document Frequency) consists in calculating the TF-IDF score for each word in a set of documents, for each document. We must provide 2 types of implementation for the MapReduce algorithm that will allow the calculation of the TF-IDF score:

Map Reduce Python Hadoop Streaming
Spark

The aim is to perform an experimental analysis to compare the performance of 2 implementations. To do this, we will perform a scalability test (the ability of a product to adapt to a change in the order of magnitude of demand) to compare the performance of the 2 implementations.

Prerequisites

Python 3.5
pyspark

Usage

The file file_generation.py allows you to automatically create.txt files in the desired language and of different sizes. To do this, simply enter the following command in the terminal once you are in the folder containing the Python file:

python  file_generation.py  langue  taille  nomber_documents

The language can be here: "Latin", "French" or "english", the size can be: "1" or "2", the number of documents for the study will be 5, but you can create as many as you want.

To launch the MapReduce script, simply enter the following command in the terminal:

chmod +x StepOneMapper.py
chmod +x StepOneReducer.py
cat test1.txt|./StepOneMapper.py|sort -k 1,1|./StepOneReducer.py>StepOneResult.txt
cat StepOneResult_1.txt|./StepTwoMapper.py|sort -k 1,1|./StepTwoReducer.py>StepTwoResult.txt
cat StepTwoResult_1.txt|./StepThreeMapper.py|sort -k 1,1|./StepThreeReducer.py

For run Spark, you should enter following command:

python  tfidf-spark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF-IDF-Map-Reduce

Overview

Prerequisites

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
generate_data		generate_data
README.md		README.md
StepOneMapper.py		StepOneMapper.py
StepOneReducer.py		StepOneReducer.py
StepThreeMapper.py		StepThreeMapper.py
StepThreeReducer.py		StepThreeReducer.py
StepTwoMapper.py		StepTwoMapper.py
StepTwoReducer.py		StepTwoReducer.py
tfidf-spark.py		tfidf-spark.py

qzhao19/TF-IDF-Map-Reduce

Folders and files

Latest commit

History

Repository files navigation

TF-IDF-Map-Reduce

Overview

Prerequisites

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages