IndEL is the first Indonesian Entity Linking (EL) benchmark dataset for general and specific domains. It was manually created, and the labeling process was performed using a semantic annotation platform, INCEpTION. There are two Indonesian NER benchmark datasets utilized for entity extraction, i.e., NER UI and IndQNER. NER UI contains entities in a general domain, and IndQNER provides entities in a specific domain, i.e., the Indonesian Translation of the Quran. IndEL is presented in NIF format. The dataset contains:
Domain | Sentences | Tokens | Linked Entities |
---|---|---|---|
General Domain | 2114 | 44833 | 4765 |
Specific Domain | 2621 | 61816 | 2453 |
The manual annotation of IndEL, in both its general and specific domains, was carried out by non-volunteer native speakers. Specifically for the specific domain (pertaining to the Indonesian translation of the Quran), the annotators comprised students in their fourth year of bachelor studies from the Quran and Tafseer department at State Islamic University Syarif Hidayatullah Jakarta.
To showcase the utility of IndEL as a benchmark dataset, it was used to evaluate cutting-edge EL systems in multilingual settings using the GERBIL framework. The EL systems included Babelfy, DBpedia Spotlight, MAG, OpenTapioca, and WAT. Steps to perform the evaluation can be found here. Evaluation results are as follows:
Metrics | Babelfy | DBpedia Spotlight | MAG | OpenTapioca | WAT |
---|---|---|---|---|---|
General Domain | |||||
Precision | 0.7278 | 0.6750 | 0.4265 | 0.7984 | 0.6118 |
Recall | 0.3719 | 0.3577 | 0.4166 | 0.4105 | 0.5549 |
F1 | 0.4923 | 0.4676 | 0.4215 | 0.5423 | 0.5820 |
Specific Domain | |||||
Precision | 0.8049 | 0.8471 | 0.1523 | 0.6179 | 0.7715 |
Recall | 0.4725 | 0.6731 | 0.1508 | 0.0310 | 0.7501 |
F1 | 0.5954 | 0.7501 | 0.1515 | 0.0590 | 0.7606 |
The details of evaluation results can be observed here, and especially for MAG, it can be accessed here and here. In comparison, the performance of the mentioned EL systems on other benchmark datasets can be seen here by selecting D2KB as the experiment type.
To further investigate the impact of how Indonesian entities are presented in Wikidata on the EL systems' performance, another evaluation was performed by employing MAG, targeting the identification of NIL entities within both the general and specific domains. The evaluation results are distinguished into linked and Not in Lexicon (NIL) entities as follows:
Domain | Total Entities | Linked Entities | NIL Entities |
---|---|---|---|
General Domain | 4765 | 3163 | 1604 |
Specific Domain | 2453 | 1930 | 523 |
We have integrated IndEL into the GERBIL framework. This integration facilitates the evaluation of both multilingual and Indonesian-specific EL systems using IndEL.
During the creation of IndEL, we identified that a significant number of entities are missing in the Indonesian Wikidata. This observation highlights the limited range of entries in Indonesian Wikidata, particularly in comparison with its counterparts in other languages. To address this, our improvement plan includes expanding the linkage of entities in IndEL to other knowledge bases, such as BabelNet, DBpedia and YAGO. Additionally, we recognize another challenge posed by the magnitude of IndEL. To address this, we plan to enrich the dataset by incorporating additional NEs from other Indonesian NER benchmark datasets.
If you have any questions or feedbacks, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id