Skip to content

hipe-eval/HIPE-2022-baseline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HIPE-2022-baseline

This repository contains the baseline models for HIPE-2022.

transformers_baseline

Presentation

transformers_baseline contains the functions and config files to run a transformer-based baseline on on HIPE-2022 datasets. Notable functionalities include:

  • Data preparation (data_preparation.py):

    • Import HIPE-compliant tsvs, using hipe_commons.
    • Tokenizing the data and aligning the labels
    • Creating custom datasets amenable to HuggingFace's transformers (see HipeDataset)
  • Training and fine-tuning (pipeline.py)

    • Single GPU support, based on HuggingFace run_ner.py.
    • Evaluation during training using seqeval
  • Evaluation (evaluation.py)

    • Predict functions
    • Reconstruct HIPE-compliant tsv from the model's predictions
    • Evaluate tsvs using HIPE-scorer.

Usage

Make sure you have the packages listed in requirements.txt installed in a virtual environment. Then please run :

python pipeline.py

With the following args:

usage: pipeline.py [-h] [--train_path TRAIN_PATH] [--train_url TRAIN_URL] [--eval_path EVAL_PATH] [--eval_url EVAL_URL] [--output_dir OUTPUT_DIR]
                   [--hipe_script_path HIPE_SCRIPT_PATH] [--config_path CONFIG_PATH] [--labels_column LABELS_COLUMN] [--model_name_or_path MODEL_NAME_OR_PATH] [--do_train] [--do_hipe_eval] [--do_debug] [--overwrite_output_dir]
                   [--device_name DEVICE_NAME] [--epochs EPOCHS] [--seed SEED] [--batch_size BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]

options:
  -h, --help            show this help message and exit
  --train_path TRAIN_PATH
                        Absolute path to the tsv data file to train on
  --train_url TRAIN_URL
                        url to the tsv data file to train on
  --eval_path EVAL_PATH
                        Absolute path to the tsv data file to evaluate on
  --eval_url EVAL_URL   url to the tsv data file to evaluate on
  --output_dir OUTPUT_DIR
                        Absolute path to the directory in which outputs are to be stored
  --hipe_script_path HIPE_SCRIPT_PATH
                        The path the CLEF-HIPE-evaluation script. This parameter is required if `do_hipe_eval`is True
  --config_path CONFIG_PATH
                        The path to a config json file from which to extract config. Overwrites other specified config
  --labels_column LABELS_COLUMN
                        Name of the tsv col to extract labels from
  --model_name_or_path MODEL_NAME_OR_PATH
                        Absolute path to model directory or HF model name (e.g. 'bert-base-cased')
  --do_train            whether to train. Leave to false if you just want to evaluate
  --do_hipe_eval        Performs CLEF-HIPE evaluation, alone or at the end of training if `do_train`.
  --do_debug            Breaks all loops after a single iteration for debugging
  --overwrite_output_dir
                        Whether to overwrite the output dir
  --device_name DEVICE_NAME
                        Device in the format 'cuda:1', 'cpu'
  --epochs EPOCHS       Total number of training epochs to perform.
  --seed SEED           Random seed
  --batch_size BATCH_SIZE
                        Batch size per device.
  --gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                        Number of steps to accumulate before performing backpropagation.

Please note that the config can also read from a json :

python pipeline.py --config_path '/abs/path/to/my_json_conf.json'

About

Baseline models for HIPE-2022

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages