Automatic Translation of Span-Prediction Datasets - official repo

This repo contains the Datasets reported in the paper and the code required to reproduce them, as well as the code to generate a new dataset

requirements

Python 3.7 or higher

Installation

You might want to start by creating a new conda environment for this repo.

conda ctreate --name <env_name> python=3.7

install the required packages

pip install -r requirements.txt

Download datasets

To download the larger datasets (3.1 GB) from S3 use the following bash scripts:

bash data/squad_translated/download.sh          # for the datasets translated by us (1.9 GB)
bash data/xquad_Translated_train/download.sh    # for the datasets translated by XQuAD (1.2 GB)

Note: all datasets are in HuggingFace Format (not original SQuAD format)

Reproduce results

To reproduce our results you can run training sessions using the compared datasets. Our trainings were carried on 2 x NVIDIA GeForce RTX 3090 devices. Training on fewer devices or less memory, might require some modifications to the training recipes.

If the Scripts are launched not from the root directory of the project, change the BASEPATH parameter

XQuAD Translated-train results

To reproduce our results on the XQuAD Translated-train datasets (evaluation on XQuAD):

bash scripts/train_xquad_eval_xquad.sh

Paper: On the Cross-lingual Transferability of Monolingual Representations
GitHub Repository: XQuAD

Important: Note that this script will run 10 consecutive training sessions and might take some time

Our translation results

To reproduce our results on the datasets generated by us (evaluation on XQuAD):

bash scripts/train_ours_eval_xquad.sh

Important: Note that this script will run 10 consecutive training sessions and might take some time

Hebrew results

To reproduce our results on the datasets generated by us and on the ParaShoot dataset (evaluation on ParaShoot):

bash scripts/train_parashoot_eval_parashoot.sh
bash scripts/train_ours_eval_parashoot.sh

Paper: ParaShoot: A Hebrew Question Answering Dataset
GitHub Repository: ParaShoot

Swedish results

To reproduce our results on the datasets generated by us (evaluation on swedish_squad_dev):

bash scripts/train_ours_eval_sv_dev_proj.sh

Paper: Building a Swedish Question-Answering Model
GitHub Repository: Building a Swedish Question-Answering Model -- Datasets

We did not manage to reproduce the results reported in the original paper

Czech results

To reproduce our results on the datasets generated by us and (evaluation in SQuAD-cs v1.1):

bash scripts/train_ours_eval_squad_cs.sh

Paper: Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer
GitHub Repository: Czech-Question-Answering

We did not manage to reproduce the results reported in the original paper

Translating to a New Language

To translate to a new language, start by implementing a class inheriting from languages.abstract_language. make sure to define the symbol parameter by the language symbol in Google Translate

Generate base translation

start by generating the base translation This will take a few hours.

python ./src/translate/translate_squad_to_base.py </path/to/train-v1.1.json> <language_symbol>
python ./src/translate/translate_squad_to_base.py </path/to/dev-v1.1.json> <language_symbol>

a new file will be generated in next to your train-v1.1.json with the name train-v1.1_<language_symbol>_base.json

Generate the matcher dataset

To generate the dataset that will be used to train the alignment model. We generate a train end validation set

python ./src/matcher/generate_matcher_dataset.py <path/to/train-v1.1_<language_symbol>_base.json> <language_symbol> --out_dir </path/to/output_dir> --enq --num_phrases_in_sentence=10 --translated --hf;
python ./src/matcher/generate_matcher_dataset.py <path/to/dev-v1.1_<language_symbol>_base.json> <language_symbol> --out_dir </path/to/output_dir> --enq --num_phrases_in_sentence=10 --translated --hf

This will generate two files in your output directory:

train set file: train-v1.1_<language_symbol>base_matcher<language_symbol>_enq.json
dev set file: dev-v1.1_<language_symbol>base_matcher<language_symbol>_enq.json

The generated files will be in HuggingFace QA dataset format (ready to be trained using transformers library)

Train the Alignment model

Next we will train the alignment model. Note that this phase will preferably should be carried on a machine with a GPU

bash ./scripts/train_matcher.sh <path/to/train-v1.1_<language_symbol>_base_matcher_<language_symbol>_enq.json> dev-v1.1_<language_symbol>_base_matcher_<language_symbol>_enq.json <language_symbol>

The results of the training will be saved in ./matcher_exp/train_matcher_<language_symbol>

Translate and Align the final dataset

Finally, we will use the trained Alignment model to align the results from the base file

python ./src/translate/translate_from_base.py <path/to/train-v1.1_<language_symbol>_base.json> <language_symbol> <./matcher_exp/train_matcher_<language_symbol>> --from_en
python ./src/translate/translate_from_base.py <path/to/dev-v1.1_<language_symbol>_base.json> <language_symbol> <./matcher_exp/train_matcher_<language_symbol>> --from_en

The two files of your new dataset will be generated

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Translation of Span-Prediction Datasets - official repo

requirements

Installation

Download datasets

Reproduce results

XQuAD Translated-train results

Our translation results

Hebrew results

Swedish results

Czech results

Translating to a New Language

Generate base translation

Generate the matcher dataset

Train the Alignment model

Translate and Align the final dataset

About

Releases

Packages

Languages

License

ofrimasad/translated_qa

Folders and files

Latest commit

History

Repository files navigation

Automatic Translation of Span-Prediction Datasets - official repo

requirements

Installation

Download datasets

Reproduce results

XQuAD Translated-train results

Our translation results

Hebrew results

Swedish results

Czech results

Translating to a New Language

Generate base translation

Generate the matcher dataset

Train the Alignment model

Translate and Align the final dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages