Skip to content

This is my research work done at Indira Gandhi Center for Atomic Research

Notifications You must be signed in to change notification settings

ayushjain1144/NukeBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NukeBERT

Introduction

alt text

Significant advances have been made in recent years on Natural Language Processing with machines surpassing human performance in many tasks, including but not limited to Question Answering. The majority of deep learning methods for Question Answering targets domains with large datasets and highly matured literature. The area of Nuclear and Atomic energy has largely remained unexplored in exploiting available unannotated data for driving industry viable applications. To tackle this issue of lack of quality dataset, this paper intriduces two datasets: NText, a eight million words dataset extracted and preprocessed from nuclear research papers and thesis; and NQuAD, a Nuclear Question Answering Dataset, which contains 700+ nuclear Question Answer pairs developed and verified by expert nuclear researchers. This paper further propose a data efficient technique based on BERT, which improves performance significantly as compared to original BERT baseline on above datasets. Both the datasets, code and pretrained weights will be made publically available, which would hopefully attract more research attraction towards the nuclear domain. Please read the paper for more details.

Documentation

The most important files are in models folder

  • bert_pretrained.ipynb: This contains the code for pretraining NukeBERT
  • bert_qa.ipynb: This file is used for question answering on NQuAD

Note: These files require that you already have access to the dataset. You can upload those dataset to your google drive and replace the corresponding data paths in the notebooks

Dataset

If you want access to the datasets, please fill this form. If your request is approved, you will be able to access this link for the datasets. If I do not reply within a week, please feel free to drop an email.

Citation

If you find our work useful, please consider citing us at

@misc{jain2020nukebert,
    title={NukeBERT: A Pre-trained language model for Low Resource Nuclear Domain},
    author={Ayush Jain and Dr. N. M. Meenachi and Dr. B. Venkatraman},
    year={2020},
    eprint={2003.13821},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

About

This is my research work done at Indira Gandhi Center for Atomic Research

Resources

Stars

Watchers

Forks