BioBERT Pre-trained Weights

This repository provides pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. Please refer to our paper BioBERT: a pre-trained biomedical language representation model for biomedical text mining for more details.

Downloading pre-trained weights

Go to releases section of this repository or click links below to download pre-trained weights of BioBERT. We provide three combinations of pre-trained weights: BioBERT (+ PubMed), BioBERT (+ PMC), and BioBERT (+ PubMed + PMC). Pre-training was based on the original BERT code provided by Google, and training details are described in our paper. Currently available versions of pre-trained weights are as follows:

BioBERT-Base v1.1 (+ PubMed 1M) - based on BERT-base-Cased (same vocabulary)
BioBERT-Large v1.1 (+ PubMed 1M) - based on BERT-large-Cased (custom 30k vocabulary), NER/QA Results
BioBERT-Base v1.0 (+ PubMed 200K) - based on BERT-base-Cased (same vocabulary)
BioBERT-Base v1.0 (+ PMC 270K) - based on BERT-base-Cased (same vocabulary)
BioBERT-Base v1.0 (+ PubMed 200K + PMC 270K) - based on BERT-base-Cased (same vocabulary)

Make sure to specify the versions of pre-trained weights used in your works. If you have difficulty choosing which one to use, we recommend using BioBERT-Base v1.1 (+ PubMed 1M) or BioBERT-Large v1.1 (+ PubMed 1M) depending on your GPU resources. Note that for BioBERT-Base, we are using WordPiece vocabulary (vocab.txt) provided by Google as any new words in biomedical corpus can be represented with subwords (for instance, Leukemia => Leu + ##ke + ##mia). More details are in the closed issue #1.

Pre-training corpus

We do not provide pre-processed version of each corpus. However, each pre-training corpus could be found in the following links:

PubMed Abstracts1: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
PubMed Abstracts2: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
PubMed Central Full Texts: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/

Estimated size of each corpus is 4.5 billion words for PubMed Abstracts1 + PubMed Abstracts2, and 13.5 billion words for PubMed Central Full Texts.

Fine-tuning BioBERT

To fine-tunine BioBERT on biomedical text mining tasks using provided pre-trained weights, refer to the DMIS GitHub repository for BioBERT.

Citation

@article{10.1093/bioinformatics/btz682,
    author = {Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},
    title = "{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}",
    journal = {Bioinformatics},
    year = {2019},
    month = {09},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btz682},
    url = {https://doi.org/10.1093/bioinformatics/btz682},
}

Contact information

For help or issues using pre-trained weights of BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee (lee.jnhk@gmail.com), or Sungdong Kim (sungdong.kim@navercorp.com) for communication related to pre-trained weights of BioBERT.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioBERT Pre-trained Weights

Downloading pre-trained weights

Pre-training corpus

Fine-tuning BioBERT

Citation

Contact information

About

Releases 4

Packages

Contributors 2

naver/biobert-pretrained

Folders and files

Latest commit

History

Repository files navigation

BioBERT Pre-trained Weights

Downloading pre-trained weights

Pre-training corpus

Fine-tuning BioBERT

Citation

Contact information

About

Resources

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Packages