Files for BioBERT tokenizer #11

anjani-dhrangadhariya · 2019-10-24T16:11:13Z

In order to use Tokenizer from BioBERT, the program requires tokenizer files for BioBERT.

tokenizer = BertTokenizer.from_pretrained('BioBERT_DIR/BioBERT_tokenizer_files')

These are the files generated when one saves the developed tokenizer using the following command.

tokenizer.save_pretrained('./my_saved_biobert_model_directory/')

This should save files like,

The file names are,

added_token.json
special_tokens_map.json
tokenizer_config.json

However, I am not able to find these files from these pretrained BioBERT weights directory.

From this post, I understand that this is linked to issue #1. Does this mean, one needs to use tokenizer from BERT and not BioBERT? What BERT tokenizer will be compatible with BioBERT?

I will be grateful for your response.

The text was updated successfully, but these errors were encountered:

hdatteln · 2019-11-17T15:09:15Z

I would be interested in this question, too; Did you ever find out more about it?

anjani-dhrangadhariya · 2019-11-19T09:29:31Z

I would be interested in this question, too; Did you ever find out more about it?

I had a deadline so I used BERT, but I will delve into it again.

jhyuklee · 2019-11-19T12:37:58Z

Hi, sorry for the inconvenience. The BERT tokenizer is exactly the same as BioBERT tokenizer. The files you are mentioning seem to be a newer version of BERT's vocabulary, which will be incompatible unless you modify the code. You can just use BioBERT's vocabulary provided along with the pre-trained BioBERT files.

hdatteln · 2019-11-19T15:40:47Z

Thank you, @jhyuklee ! Yeah, that's what i did in the end, and it seems to be working ok: the_tokenizer = BertTokenizer.from_pretrained('biobert_f/biobert_v1.1_pubmed/vocab.txt')

jhyuklee closed this as completed Nov 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files for BioBERT tokenizer #11

Files for BioBERT tokenizer #11

anjani-dhrangadhariya commented Oct 24, 2019 •

edited

Loading

hdatteln commented Nov 17, 2019

anjani-dhrangadhariya commented Nov 19, 2019

jhyuklee commented Nov 19, 2019 •

edited

Loading

hdatteln commented Nov 19, 2019

Files for BioBERT tokenizer #11

Files for BioBERT tokenizer #11

Comments

anjani-dhrangadhariya commented Oct 24, 2019 • edited Loading

hdatteln commented Nov 17, 2019

anjani-dhrangadhariya commented Nov 19, 2019

jhyuklee commented Nov 19, 2019 • edited Loading

hdatteln commented Nov 19, 2019

anjani-dhrangadhariya commented Oct 24, 2019 •

edited

Loading

jhyuklee commented Nov 19, 2019 •

edited

Loading