This source code is made based on Hugging Face's tutorial on QA Extraction using the Transformer architecture language model. The input to the system is a context and a question, the system will extract the answer in that context.
The model used is bhavikardeshna/xlm-roberta-base-vietnamese, which is a language model based on RoBERTa, trained on the Vietnamese dataset.
The model is described in Cascading Adaptors to Leverage English Data to Improve Performance of
Question Answering for Low-Resource Languages paper.
The dataset used is UIT-ViQuAD. This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages from 174 Vietnamese articles from Wikipedia. However, in processing, I eliminated more than 3000 questions with no answers.
The dataset after processing is divided with test size is 0.06. Below are the evaluation results of the test set:
EM | F1-SCORE |
---|---|
52.38 | 77.67 |
Below are some test results:
Test.1
Test.2
Test.3
Relatively good 😅