AfriTeVa: Extending “Small Data” Pretraining Approaches to Sequence-to-Sequence Models

This repo contains the code for the paper AfriTeVa: Extending “Small Data” Pretraining Approaches to Sequence-to-Sequence Models AfriTeVa is a sequence to

Languages Covered During Pretraining

Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor)

Models:

We release the following pretrained models:

AfriTeVa Small (64M params)
AfriTeVa Base (229M params)
AfriTeVa Large (745M params)

Reproducibility

Datasets

Language Modelling: The data for language modelling can be downloaded from this URL
Machine Translation: To obtain the Machine Translation dataset, please download it from this repository
Text Classification: To obtain the topic classification dataset, please download it from this repository

Tokenizer

We trained a Sentencepiece Unigram tokenizer for AfriTeVa, and it can be downloaded from Here However, to train a custom tokenizer, run the command below with the following arguments

data_path: Path to your training file/files
vocab_size: Size of your learned vocabulary (number of tokens)
output_path: Path to store learned tokenizer files

(virtual_env)$ bash learn_subword.sh ${data_path} ${vocab_size} ${output_path}

Citation

@inproceedings{jude-ogundepo-etal-2022-afriteva,
    title = "{A}fri{T}e{VA}: Extending ?Small Data? Pretraining Approaches to Sequence-to-Sequence Models",
    author = "Jude Ogundepo, Odunayo  and
      Oladipo, Akintunde  and
      Adeyemi, Mofetoluwa  and
      Ogueji, Kelechi  and
      Lin, Jimmy",
    booktitle = "Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing",
    month = jul,
    year = "2022",
    address = "Hybrid",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.deeplo-1.14",
    doi = "10.18653/v1/2022.deeplo-1.14",
    pages = "126--135",
    abstract = "t",
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
classification_scripts		classification_scripts
machine_translation		machine_translation
ner_scripts		ner_scripts
scripts		scripts
src		src
training_configs		training_configs
.gitignore		.gitignore
README.md		README.md
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AfriTeVa: Extending “Small Data” Pretraining Approaches to Sequence-to-Sequence Models

Languages Covered During Pretraining

Reproducibility

Datasets

Tokenizer

Citation

About

Releases

Packages

Contributors 2

Languages

castorini/afriteva

Folders and files

Latest commit

History

Repository files navigation

AfriTeVa: Extending “Small Data” Pretraining Approaches to Sequence-to-Sequence Models

Languages Covered During Pretraining

Reproducibility

Datasets

Tokenizer

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages