Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files for BioBERT tokenizer #11

Closed
anjani-dhrangadhariya opened this issue Oct 24, 2019 · 4 comments
Closed

Files for BioBERT tokenizer #11

anjani-dhrangadhariya opened this issue Oct 24, 2019 · 4 comments

Comments

@anjani-dhrangadhariya
Copy link

anjani-dhrangadhariya commented Oct 24, 2019

In order to use Tokenizer from BioBERT, the program requires tokenizer files for BioBERT.

tokenizer = BertTokenizer.from_pretrained('BioBERT_DIR/BioBERT_tokenizer_files')

These are the files generated when one saves the developed tokenizer using the following command.

tokenizer.save_pretrained('./my_saved_biobert_model_directory/')​

This should save files like,

The file names are,

  1. added_token.json
  2. special_tokens_map.json
  3. tokenizer_config.json

However, I am not able to find these files from these pretrained BioBERT weights directory.

From this post, I understand that this is linked to issue #1. Does this mean, one needs to use tokenizer from BERT and not BioBERT? What BERT tokenizer will be compatible with BioBERT?

I will be grateful for your response.

@hdatteln
Copy link

I would be interested in this question, too; Did you ever find out more about it?

@anjani-dhrangadhariya
Copy link
Author

I would be interested in this question, too; Did you ever find out more about it?

I had a deadline so I used BERT, but I will delve into it again.

@jhyuklee
Copy link
Collaborator

jhyuklee commented Nov 19, 2019

Hi, sorry for the inconvenience. The BERT tokenizer is exactly the same as BioBERT tokenizer. The files you are mentioning seem to be a newer version of BERT's vocabulary, which will be incompatible unless you modify the code. You can just use BioBERT's vocabulary provided along with the pre-trained BioBERT files.

@hdatteln
Copy link

Thank you, @jhyuklee ! Yeah, that's what i did in the end, and it seems to be working ok: the_tokenizer = BertTokenizer.from_pretrained('biobert_f/biobert_v1.1_pubmed/vocab.txt')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants