This is an open source implementation of our solution to the competition: PAN@CLEF 2020 Author Profiling.
Our approach uses a logistic regressor model with character and word n-grams TF-IDF features.
- Python 3.7
- We need the following packages (using pip):
pip install hyperopt
pip install joblib
pip install scikit-learn
pip install nltk
pip install tweet-preprocessor
Our approach achieves the third best solution in the private test, the results are shown in the table below:
LANG | ACC |
---|---|
ES | 0.78 |
EN | 0.73 |
Our team is deborjavalero20, you can check the full ranking in this link
The commands below show how to replicate the experiments.
The train.py script trains the Spanish and English models using the corpus located at DATA_DIR
and stores the trained models on RESOURCES_DIR
.
python3 train.py DATA_DIR RESOURCES_DIR
The test.py script generates the Spanish and the English hypothesis. The argument DATA_DIR
is the folder of the input data, and the argument HYPOTHESIS_DIR
will be the directory to store own hypothesis.
python3 test.py -c DATA_DIR -o HYPOTHESIS_DIR
The MIT License (MIT)
Copyright (c) 2020 Francisco de Borja Valero