Skip to content

Data Repository for LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons

Notifications You must be signed in to change notification settings

BatsResearch/LexC-Gen-Data-Archive

Repository files navigation

LexC-Gen Generated Data Repository

This repository stores all the intermediate and final data artifacts of LexC-Gen for both NusaX and SIB-200 tasks.

HuggingFace 🤗

For researchers or practitioners who directly want to use our LexC-Gen generated data, we refer you to our datasets hosted on HuggingFace:

The datasets on HuggingFace has the structure of {id, text, label}. For instance, for NusaX sentiment analysis, an example is

{'id': '1',
 'text': 'Anchorwoman : Hai , pubuet n't reuhung atra aneuk kumuen meulawan buli aneuk miet , ikat atra getnyan fingers ngeun saboh boh manok ngeun jangka gobnyan ho saboh pillar .'
 'label': 1}

Intermediate Data Artifacts

LexC-Gen overview

This repository stores the intermediate data artifacts of LexC-Gen for both NusaX and SIB-200 tasks. The data artifacts include:

  • raw generated English texts data after step (2) (.txt format in {task}-lexcgen-raw-data/)
  • raw texts converted to csv (.csv format in {task}-lexcgen-processed-data/)
  • filtered data after input-label consistency filtering, which is after step (3) (filtered-*.csv)
  • tokenized English data with Stanza after filtering (tokenized_filtered-*.csv)
  • translated to respective low-resource languages using Gatitos bilingual lexicon, which is after step (4) (translated-*.csv)

The file string name is in the format of: {model_name}-{task_type}-en-{lang}-ctg-total{size}. Here are their descriptions:

  • model_name: LLM used to generate lexicon-conditioned data
  • task_type: sa for sentiment analysis and tm for topic classification (tm because originally we call it topic modeling)
  • lang: low-resource language code
  • size: 1K, 10K, 100K generated data size, which refers to the size of LexC-Gen generated data before filtering.

Bibtex

@inproceedings{yong2024lexcgen,
  title = {LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons},
  author = {Zheng-Xin Yong and Cristina Menghini and Stephen H. Bach},
  booktitle = {Findings of the Empirical Methods in Natural Language Processing: EMNLP 2024},
  year = {2024}
}

About

Data Repository for LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published