Clinical Decision Support System with Deep Learning (Natural Language Processing)

This project focuses on the Natural Language Processing (NLP) section of the clinical decision supporting system for cardio-pneumological diseases, in collaboration with the Third Eye Intelligence, as part of the Bachelor Year Group Project of MEng Biomedical Engineering (Computational Bioengineering), Imperial College London. The paper for this study can be found here: https://drive.google.com/file/d/1JKRXKfszJk8KMNhxOEtvyNehMl98FHNn/view?usp=sharing

In this section, we aim to fine-tune the BERT-based Bio+ClinicalBERT onto all radiology reports available on the MIMIC-CXR Database such that our models can understand and be able to classify different diagnoses between reports. All the trained models can be found in Hugging Face.

Codes for our deep learning models are written in Python and implemented with PyTorch 1.7.1.

Masked Language Modelling

The CXR_BioClinicalBERT_MLM was fine-tuned on a Masked Language Modelling (MLM) task. The model is trained to predict text by attempting to recover the whole word, such that we can validate model’s understanding of radiological contents. The model achieved a perplexity score of 1.0710 after 10 epochs of training.

The contextualized word embeddings output were converted into sentence embeddings by a mean pooling operation. Sentence embeddings of different radiological reports were semantically compared using the cosine similarity calculation. Examples of the results can be found in the paper and ReportSimilarity_sections.ipynb.

Text Classification

The CXR_BioClinicalBERT_Class was fine-tuned from the CXR_BioClinicalBERT_MLM model on a classification task, such that it can perform multi-label text classification across 13 different cardiopulmonary conditions. Classification evaluation can be found in the paper and CXR_BioClinicalBERT_Class.ipynb. Prediction examples can be found in TC_prediction.ipynb.

Dataset and Pre-processing

The MIMIC-CXR database (v2.0.0) was used for training, which is the largest publicly available Chest X-ray dataset containing 377,110 radiographs and 225,606 associated radiology reports in free-text format.

Only simple text pre-processing including punctuation and number removal was applied. Typical steps including stemming, lemmatization and stopword removal were avoided in training with highly context-dependent transformer models like BERT.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Dataset		Dataset
MaskedLanguageModelling		MaskedLanguageModelling
TextClassification		TextClassification
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clinical Decision Support System with Deep Learning (Natural Language Processing)

Masked Language Modelling

Text Classification

Dataset and Pre-processing

About

Releases

Packages

Contributors 3

Languages

dorltcheng/nlp-fineTuningBERT

Folders and files

Latest commit

History

Repository files navigation

Clinical Decision Support System with Deep Learning (Natural Language Processing)

Masked Language Modelling

Text Classification

Dataset and Pre-processing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages