Skip to content
This repository has been archived by the owner on May 28, 2021. It is now read-only.

The third best solution for PAN@CLEF 2020 Author Profiling Competition.

License

Notifications You must be signed in to change notification settings

franbvalero/clef-2020-author-profiling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logistic regression with TF-IDF features

License: MIT

This is an open source implementation of our solution to the competition: PAN@CLEF 2020 Author Profiling.

Our approach uses a logistic regressor model with character and word n-grams TF-IDF features.

Dependencies

  • Python 3.7
  • We need the following packages (using pip):
pip install hyperopt
pip install joblib
pip install scikit-learn
pip install nltk
pip install tweet-preprocessor

Results

Our approach achieves the third best solution in the private test, the results are shown in the table below:

LANG ACC
ES 0.78
EN 0.73

Our team is deborjavalero20, you can check the full ranking in this link

Usage

The commands below show how to replicate the experiments.

The train.py script trains the Spanish and English models using the corpus located at DATA_DIR and stores the trained models on RESOURCES_DIR.

python3 train.py DATA_DIR RESOURCES_DIR

The test.py script generates the Spanish and the English hypothesis. The argument DATA_DIR is the folder of the input data, and the argument HYPOTHESIS_DIR will be the directory to store own hypothesis.

python3 test.py -c DATA_DIR -o HYPOTHESIS_DIR

License

The MIT License (MIT)

Copyright (c) 2020 Francisco de Borja Valero

About

The third best solution for PAN@CLEF 2020 Author Profiling Competition.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages