Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load BioBERT weights #135

Closed
JohnGiorgi opened this issue May 18, 2019 · 9 comments
Closed

Load BioBERT weights #135

JohnGiorgi opened this issue May 18, 2019 · 9 comments
Assignees
Labels
enhancement New feature or request feature

Comments

@JohnGiorgi
Copy link
Contributor

JohnGiorgi commented May 18, 2019

Figure out how to load BioBERTs weights.

See these links for help.

@JohnGiorgi JohnGiorgi added enhancement New feature or request feature labels May 18, 2019
@JohnGiorgi JohnGiorgi self-assigned this May 18, 2019
@JohnGiorgi
Copy link
Contributor Author

JohnGiorgi commented May 21, 2019

Documenting how I finally got this to work:

  1. Download the latest BioBERT pre-trained models from here. This was the only model I could convert without issue.
  2. Assuming pytorch_pretrained_bert is installed (pip install pytorch_pretrained_bert if not)
export BERT_BASE_DIR=path/to/biobert_v1.1_pubmed
pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch $BERT_BASE_DIR/model.ckpt-1000000 $BERT_BASE_DIR/bert_config.json $BERT_BASE_DIR/pytorch_model.bin

Where BERT_BASE_DIR should point to the downloaded and uncompressed BioBERT model.

  1. Finally, place pytorch_model.bin, bert_config.json and vocab.txt (from BERT_BASE_DIR) in a folder (e.g biobert) and g-zip it
tar -cvzf biobert.gz biobert
  1. The model can be loaded in pytorch_pretrained_bert as such
model = BertForTokenClassification.from_pretrained('path/to/biobert.gz', num_labels=num_labels)
self.tokenizer = BertTokenizer.from_pretrained('path/to/biobert.gz', do_lower_case=False)

@jhyuklee
Copy link

Hi, we've updated all the other BioBERT weights (v1.0) as the same format as v1.1, so it should work now.
Thank you.

@JohnGiorgi
Copy link
Contributor Author

That’s great, thanks for letting me know. Is there any reason to use v1.0 if I just want the best performance possible? Or should I stick with v1.1?

@jhyuklee
Copy link

For most tasks, it will be better to stick with with v1.1, but v1.0 (+PubMed 200K +PMC 270K) works well, too as shown in the paper (only minor differences). Note that we haven't updated our paper with performance of v1.1 (will take some time).
If performance on a single targeted task matters, you can compare them and choose what to use.

@JohnGiorgi
Copy link
Contributor Author

Right. Okay great, thanks for the response!

@Colelyman
Copy link

Thanks for sharing what worked for you. I followed the steps provided and everything worked except I discovered (as of writing) that when compressing the files together they can't be in a directory, they just have to be flat.

@phaniram-sayapaneni
Copy link

phaniram-sayapaneni commented Aug 5, 2020

Hi, we've updated all the other BioBERT weights (v1.0) as the same format as v1.1, so it should work now.
Thank you.

Hi @jhyuklee , the download files[BioBERT-Base v1.1 (+ PubMed 1M] donot contain .ckpt file, it has : model.ckpt-1000000.data-00000-of-00001, model.ckpt-1000000.index, model.ckpt-1000000.meta

which one is an accurate checkpoint file?
when I try load weights from model.ckpt-1000000.data-00000-of-00001 --> tf.train.list_variables('model.ckpt-1000000.data-00000-of-00001') it is throwing me error:

DataLossError: Unable to open table file biobert_v1.1_pubmed/model.ckpt-1000000.data-00000-of-00001: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

need to port the bio-bert to pytorch, to be able to compare with other SOTA/research models

@phaniram-sayapaneni
Copy link

Hi @JohnGiorgi , I tried the steps you mentioned, but this error while importing the gz file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

any hints?

@JohnGiorgi
Copy link
Contributor Author

Hi @phaniram-sayapaneni,

Are you simply looking to load BioBERT with HF Transformers? If so, you can follow this code: https://huggingface.co/monologg/biobert_v1.1_pubmed.

If you search BioBERT here you can see several varients and how to load them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature
Projects
None yet
Development

No branches or pull requests

4 participants