Skip to content

dice-group/ELMEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ELMEval: a framework to evaluate large language models (LLMs) in Entity Linking (EL) tasks for Low-resource Languages (LrLs), specifically in Indonesian

Ria Hari Gusmita, Asep Fajar Firmansyah, Hamada Zahera, Axel-Cyrille Ngonga Ngomo

Introduction

We present ELMEval, a framework designed to evaluate LLMs in EL tasks for LrLs, assessing their effectiveness in data annotation. By doing so, we aim to suggest a cost-effective solution for creating or expanding EL benchmark datasets for LrLs. We use IndEL to perform evaluations in both zero-shot and fine-tuning settings with multilingual and Indonesian monolingual LLMs. The multilingual LLMs include GPT-3.5, GPT-4, and LLaMA-3, while the monolingual Indonesian LLMs include Komodo and Merak. Subsequently, we perform results evaluation to conduct accuracy analysis, generalization analysis, and error analysis. Accuracy analysis aims to assess how accurately the LLMs can identify and link entities to the correct entries on Wikidata. This analysis includes measuring precision, recall, and F1-score. Generalization analysis is performed to evaluate how well LLMs can generalize from one domain to another (cross-domain evaluation) and perform in mixed-domain settings, where they are fine-tuned on combined data from general and specific domains and then tested on specific domains. In the final analysis stage, we perform detailed error analysis to identify common types of mistakes made by LLMs, such as misidentifying entities or failing to link them correctly.

IndEL

IndEL is the first Indonesian EL benchmark dataset, covering both general and specific domains. It uses Wikidata as the knowledge base and is manually annotated following meticulous guidelines. The entities in the general domain are sourced from the Indonesian NER benchmark dataset, NER UI, while those in the specific domain are gathered from IndQNER, an Indonesian NER benchmark dataset based on the Indonesian translation of the Quran. IndEL has been utilized to evaluate five multilingual EL systems, including Babelfy, DBpedia Spotlight, MAG, OpenTapioca, and WAT using the GERBIL framework platform. Details on the dataset as well as experiment results can be seen here.

Evaluation Process

Similar to human-based annotation, where annotation guidelines ensure standard and correct results, we define relevant prompts for the LLMs. These prompts comprise two parts: task description and desired outputs, as shown below.

Instruction Template
Task Description Find entities and their corresponding entry links in Wikidata within the following sentence. Use the context of the sentence to determine the correct entries in Wikidata.
Output Format The output should be formatted as: [entity1=link1, entity2=link2]. No explanations are needed.
Sample Sentence Pria kelahiran Bogor, 16 Maret 60 tahun silam itu juga ditunjuk sebagai salah satu direktur Indofood dalam RUPS Juni 2008 silam. (A man born in Bogor, 60 years ago on March 16, was also appointed as one of the directors of Indofood in the General Meeting of Shareholders in June 2008.)

In the zero-shot setting, we prompt the LLMs using an instruction format, where the prompt includes only the task description and output format. Meanwhile, in the fine-tuning setting, the LLMs are provided with detailed prompts and example sentences from the dataset. To support the zero-shot and fine-tuning experiments, we split IndEL into training, validation, and test sets using an 8:1:1 ratio. The following are the details of the split:

Domain Total Sentences Train Validation Test
General Domain 2114 1673 229 212
Specific Domain 2621 2075 283 263

Steps in Zero-shot Experiment

GPT-4

  1. Model: GPT-4
  2. Datasets preparation: Consider to update the domain manually on preparing_dataset.py
domain = "general-domain" # change to 'specific-domain' if you want to generate test dataset for specific domain and vice-versa
cd scripts/gpt
python preparing_dataset.py
  1. Execute zero-shot-based prediction
cd scripts/gpt
python run_predictions.py

LLaMA Family

  1. Model: Komodo-7b-base, Llama-3-8B-Instruct, Merak-7B-v4
  2. Datasets preparation The datasets are available in datasets directory
  3. Execute zero-shot-based prediction The value of domain and base_model_name are subject to change.
cd scripts/llama
python run_predictions.py

Steps in Fine-tuning Experiment

  1. Datasets preparation: To prepare the datasets for training the GPT model, refer to step 2 in the LLaMA Family section. Consider the path of the source dataset, the domain, and the name of the file where the processed data will be stored.
  2. Fine-tuning process
  • GPT-3.5 We use GPT-3.5 to fine-tune the model due to its availability. To perform this process, you can follow the procedure on OpenAI's fine-tuning platform. In this experiment, we set the hyperparameters as follows: number of epochs = 3, batch size = 8, and learning rate multiplier = 2.
  • Llama Family The value of domain and base_model_name are subject to change.
cd scripts/llama
python llm-finetuning.py

Evaluation Results

We evaluate four LLMs, GPT-4, Komodo, LLaMA-3, and Merak in the EL task using the IndEL dataset in the zero-shot setting. Additionally, we evaluate GPT-3.5 (we did not have access to fine-tune GPT-4), Komodo, LLaMA-3, and Merak in the fine-tuning setting. The followings are the results measured in precision, recall, and F1-score.

General Domain with Zero-shot
Metrics GPT-4 Komodo LLaMA-3 Merak
Precision 0.083 0.000 0.003 0.000
Recall 0.089 0.000 0.003 0.000
F1 0.083 0.000 0.003 0.000
Specific Domain with Zero-shot
Precision 0.010 0.000 0.000 0.000
Recall 0.016 0.000 0.000 0.000
F1 0.012 0.000 0.000 0.000
General Domain with Fine-tuning
Metrics GPT-3.5 Komodo LLaMA-3 Merak
Precision 0.385 0.018 0.084 0.045
Recall 0.373 0.026 0.117 0.039
F1 0.373 0.021 0.093 0.041
Specific Domain with Fine-tuning
Precision 0.616 0.221 0.415 0.446
Recall 0.610 0.471 0.444 0.393
F1 0.611 0.285 0.409 0.407

Contact

If you have any questions or feedbacks, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages