Skip to content

agesmundo/HadoopPerceptron

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Perceptron training prediction and evaluation for Hadoop
reference for implementation : http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36266.pdf

===========================================

1) DOWNLOAD AND COMPILE:

$ git clone git://github.com/agesmundo/HadoopPerceptron.git
$ cd HadoopPerceptron
$ make

#optional:
#before invoking "$ make" you may want to edit the Makefile
#and set the HADOOP_CORE variable to point the hadoop-core-VERSION.jar matching with your cluster version.


===========================================

2) RUN TRAIN:

# usage:
$ hadoop jar jars/HP.jar Train -i <input_folder> -o <output_folder_prefix> [options]
# to display the available options:
$ hadoop jar jars/HP.jar Train -help

# see details of INPUT FILES FORMAT in the section below
# the output of each train iteration can be found in "<output_folder_prefix>_<iteration_id>" where <iteration_id> starts from 1


RUN TRAIN TEST:

# load test files in the hadoop dfs with "$ make put" or:
$ hadoop fs -put test/ /

# run the training on the sample train set with "$ make train" or:
$ hadoop jar jars/HP.jar Train -i /test/train_folder -o /test/train_out -N 2 
# the final parameters of the model will be stored in /test/train_out_2
# to display the learnt parameters:
$ hadoop fs -cat /test/train_out_2/*
# for more details on the format of the parameter files see the WEIGHT FILE FORMAT section below

============================================


3) RUN PREDICTION:

# usage:
$ hadoop jar jars/HP.jar Predict -i <input_folder> -o <output_folder> -p <parameters_folder> [options]
# to display the available options:
$ hadoop jar jars/HP.jar Predict -help


RUN PREDICTION TEST:

# run the prediction on sample test set using parameters generated by the TRAIN TEST with "$ make predict" or:
$ hadoop jar jars/HP.jar Predict -i /test/test_folder -o /test/predict_out -p /test/train_out_2
# you can see the generated labelled text with:
$ hadoop fs -cat /test/predict_out/*
# for more details on the format of the generated labelled text files see the PREDICTION OUTPUT FILES FORMAT section below


===========================================


4) RUN EVALUATION:

# usage:
$ hadoop jar jars/HP.jar Evaluate-i <input_folder> -o <output_folder> -p <parameters_folder> [options]
# to display the available options:
$ hadoop jar jars/HP.jar Evaluate -help


RUN EVALUATION TEST:

# run the evaluation on the sample set using parameters generated by the TRAIN TEST with "$ make evaluate" or:
$ hadoop jar jars/HP.jar Evaluate -i /test/train_folder -o /test/evaluate_out -p /test/train_out_2
# you can see the generated labelled text with:
$ hadoop fs -cat /test/evaluate_out/*
# for more details on the format of the accuracy files see the EVALUATION OUTPUT FILES FORMAT section below

===========================================

INPUT FILES FORMAT
one sentence per line
words divided by spaces
input for Train and Evaluate need gold labels
gold labels are appended to words
words and gold labels are separated by a special string: defaultLabelSeparator = "_" defined in Sentence.java
make sure this special string is not used inside words or labels 
(also defaultFeatureSeparator = "|" defined in Perceptrong.java shell not appear neither in text or labels)
input file with gold labels example:
--------------------
he_PRON and_CONJ young_ADJ Mary_NOUN walk_VERB
John_NOUN is_VERB working_VERB
--------------------


===========================================

PREDICTION OUTPUT FILES FORMAT
output from prediction has one sentence per line :
<input_sentence>\t|||\t<label_sequence>
example:
--------------------
he and young Mary walk	|||	PRON CONJ ADJ NOUN VERB
John is working	|||	NOUN VERB VERB
--------------------


===========================================

EVALUATION OUTPUT FILES FORMAT
evaluation output a file with two lines:
--------------------
true	<number_of_correct_classifications>
false	<number_of_wrong_classifications>
--------------------


===========================================

WEIGHT FILE FORMAT
the weight file store the perceptron weight vector
one weight per line
format:
<candidate_label>|<feature_string>	<weight>

weight file example:
--------------------
VERB|previous_label=NOUN	0.25
CONJ|length_cur_5	-0.25
ADJ|pre_wa	-0.25
--------------------
see Features.java for more details