This repo contains the code for the paper AfriTeVa: Extending “Small Data” Pretraining Approaches to Sequence-to-Sequence Models AfriTeVa is a sequence to
Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor)
Models:
We release the following pretrained models:
- AfriTeVa Small (64M params)
- AfriTeVa Base (229M params)
- AfriTeVa Large (745M params)
-
Language Modelling: The data for language modelling can be downloaded from this URL
-
Machine Translation: To obtain the Machine Translation dataset, please download it from this repository
-
Text Classification: To obtain the topic classification dataset, please download it from this repository
We trained a Sentencepiece Unigram tokenizer for AfriTeVa, and it can be downloaded from Here However, to train a custom tokenizer, run the command below with the following arguments
- data_path: Path to your training file/files
- vocab_size: Size of your learned vocabulary (number of tokens)
- output_path: Path to store learned tokenizer files
(virtual_env)$ bash learn_subword.sh ${data_path} ${vocab_size} ${output_path}
@inproceedings{jude-ogundepo-etal-2022-afriteva,
title = "{A}fri{T}e{VA}: Extending ?Small Data? Pretraining Approaches to Sequence-to-Sequence Models",
author = "Jude Ogundepo, Odunayo and
Oladipo, Akintunde and
Adeyemi, Mofetoluwa and
Ogueji, Kelechi and
Lin, Jimmy",
booktitle = "Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing",
month = jul,
year = "2022",
address = "Hybrid",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.deeplo-1.14",
doi = "10.18653/v1/2022.deeplo-1.14",
pages = "126--135",
abstract = "t",
}