Skip to content

Automatic Translation of Span-Prediction Datasets - official repo

License

Notifications You must be signed in to change notification settings

ofrimasad/translated_qa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatic Translation of Span-Prediction Datasets - official repo

This repo contains the Datasets reported in the paper and the code required to reproduce them, as well as the code to generate a new dataset

requirements

  • Python 3.7 or higher

Installation

You might want to start by creating a new conda environment for this repo.

conda ctreate --name <env_name> python=3.7

install the required packages

pip install -r requirements.txt

Download datasets

To download the larger datasets (3.1 GB) from S3 use the following bash scripts:

bash data/squad_translated/download.sh          # for the datasets translated by us (1.9 GB)
bash data/xquad_Translated_train/download.sh    # for the datasets translated by XQuAD (1.2 GB)

Note: all datasets are in HuggingFace Format (not original SQuAD format)

Reproduce results

To reproduce our results you can run training sessions using the compared datasets. Our trainings were carried on 2 x NVIDIA GeForce RTX 3090 devices. Training on fewer devices or less memory, might require some modifications to the training recipes.

If the Scripts are launched not from the root directory of the project, change the BASEPATH parameter

XQuAD Translated-train results

To reproduce our results on the XQuAD Translated-train datasets (evaluation on XQuAD):

bash scripts/train_xquad_eval_xquad.sh

Paper: On the Cross-lingual Transferability of Monolingual Representations
GitHub Repository: XQuAD

  • Important: Note that this script will run 10 consecutive training sessions and might take some time

Our translation results

To reproduce our results on the datasets generated by us (evaluation on XQuAD):

bash scripts/train_ours_eval_xquad.sh
  • Important: Note that this script will run 10 consecutive training sessions and might take some time

Hebrew results

To reproduce our results on the datasets generated by us and on the ParaShoot dataset (evaluation on ParaShoot):

bash scripts/train_parashoot_eval_parashoot.sh
bash scripts/train_ours_eval_parashoot.sh

Paper: ParaShoot: A Hebrew Question Answering Dataset
GitHub Repository: ParaShoot

Swedish results

To reproduce our results on the datasets generated by us (evaluation on swedish_squad_dev):

bash scripts/train_ours_eval_sv_dev_proj.sh

Paper: Building a Swedish Question-Answering Model
GitHub Repository: Building a Swedish Question-Answering Model -- Datasets

  • We did not manage to reproduce the results reported in the original paper

Czech results

To reproduce our results on the datasets generated by us and (evaluation in SQuAD-cs v1.1):

bash scripts/train_ours_eval_squad_cs.sh

Paper: Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer
GitHub Repository: Czech-Question-Answering

  • We did not manage to reproduce the results reported in the original paper

Translating to a New Language

To translate to a new language, start by implementing a class inheriting from languages.abstract_language. make sure to define the symbol parameter by the language symbol in Google Translate

Generate base translation

start by generating the base translation This will take a few hours.

python ./src/translate/translate_squad_to_base.py </path/to/train-v1.1.json> <language_symbol>
python ./src/translate/translate_squad_to_base.py </path/to/dev-v1.1.json> <language_symbol>

a new file will be generated in next to your train-v1.1.json with the name train-v1.1_<language_symbol>_base.json

Generate the matcher dataset

To generate the dataset that will be used to train the alignment model. We generate a train end validation set

python ./src/matcher/generate_matcher_dataset.py <path/to/train-v1.1_<language_symbol>_base.json> <language_symbol> --out_dir </path/to/output_dir> --enq --num_phrases_in_sentence=10 --translated --hf;
python ./src/matcher/generate_matcher_dataset.py <path/to/dev-v1.1_<language_symbol>_base.json> <language_symbol> --out_dir </path/to/output_dir> --enq --num_phrases_in_sentence=10 --translated --hf

This will generate two files in your output directory:

  • train set file: train-v1.1_<language_symbol>base_matcher<language_symbol>_enq.json
  • dev set file: dev-v1.1_<language_symbol>base_matcher<language_symbol>_enq.json

The generated files will be in HuggingFace QA dataset format (ready to be trained using transformers library)

Train the Alignment model

Next we will train the alignment model. Note that this phase will preferably should be carried on a machine with a GPU

bash ./scripts/train_matcher.sh <path/to/train-v1.1_<language_symbol>_base_matcher_<language_symbol>_enq.json> dev-v1.1_<language_symbol>_base_matcher_<language_symbol>_enq.json <language_symbol>

The results of the training will be saved in ./matcher_exp/train_matcher_<language_symbol>

Translate and Align the final dataset

Finally, we will use the trained Alignment model to align the results from the base file

python ./src/translate/translate_from_base.py <path/to/train-v1.1_<language_symbol>_base.json> <language_symbol> <./matcher_exp/train_matcher_<language_symbol>> --from_en
python ./src/translate/translate_from_base.py <path/to/dev-v1.1_<language_symbol>_base.json> <language_symbol> <./matcher_exp/train_matcher_<language_symbol>> --from_en

The two files of your new dataset will be generated

About

Automatic Translation of Span-Prediction Datasets - official repo

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published