NukeBERT

Introduction

Significant advances have been made in recent years on Natural Language Processing with machines surpassing human performance in many tasks, including but not limited to Question Answering. The majority of deep learning methods for Question Answering targets domains with large datasets and highly matured literature. The area of Nuclear and Atomic energy has largely remained unexplored in exploiting available unannotated data for driving industry viable applications. To tackle this issue of lack of quality dataset, this paper intriduces two datasets: NText, a eight million words dataset extracted and preprocessed from nuclear research papers and thesis; and NQuAD, a Nuclear Question Answering Dataset, which contains 700+ nuclear Question Answer pairs developed and verified by expert nuclear researchers. This paper further propose a data efficient technique based on BERT, which improves performance significantly as compared to original BERT baseline on above datasets. Both the datasets, code and pretrained weights will be made publically available, which would hopefully attract more research attraction towards the nuclear domain. Please read the paper for more details.

Documentation

The most important files are in models folder

bert_pretrained.ipynb: This contains the code for pretraining NukeBERT
bert_qa.ipynb: This file is used for question answering on NQuAD

Note: These files require that you already have access to the dataset. You can upload those dataset to your google drive and replace the corresponding data paths in the notebooks

Dataset

If you want access to the datasets, please fill this form. If your request is approved, you will be able to access this link for the datasets. If I do not reply within a week, please feel free to drop an email.

Citation

If you find our work useful, please consider citing us at

@misc{jain2020nukebert,
    title={NukeBERT: A Pre-trained language model for Low Resource Nuclear Domain},
    author={Ayush Jain and Dr. N. M. Meenachi and Dr. B. Venkatraman},
    year={2020},
    eprint={2003.13821},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
Images		Images
docs		docs
eval		eval
models		models
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NukeBERT

Introduction

Documentation

Dataset

Citation

About

Languages

ayushjain1144/NukeBERT

Folders and files

Latest commit

History

Repository files navigation

NukeBERT

Introduction

Documentation

Dataset

Citation

About

Resources

Stars

Watchers

Forks

Languages