Is the vocab.txt correct? #1

joelkuiper · 2019-01-29T15:17:52Z

Just a general question I guess, but after inspecting the vocab.txt it doesn't seem to be particularly biomedically related (seems like its the old one) is this correct?

I'm trying to use these pretrained models in an experiment for NER, and I'd like to be able to acquire a distributional vector given a sequence of tokens (ideally bolting it into an existing Keras model, but I'm not set on that idea)

jhyuklee · 2019-01-29T15:35:40Z

Yes, the WordPiece vocab is exactly the same as the original BERT for several reasons. First, we wanted to use pre-trained BERT released by Google which makes us to use the same WordPiece vocab. Second, because the WordPiece vocab is based on subword units, any new words in biomedical corpus could be turned into proper embeddings (might be tuned during fine-tuning). We could try building our own vocabs using biomedical corpora, but that would lose compatibility with the original pre-trained BERT.

joelkuiper · 2019-01-29T17:09:44Z

Got it! Thanks for the quick and helpful reply 👍

I can understand why keeping compatibility with the original BERT is important. Personally, I would like to have a custom dictionary, since I think there might be some interesting opportunity for fine tuning as a lot of medical jargon (like drug names and chemicals) have somewhat of a unique internal structure that is now lost during the subword tokenization. But it'd be rude to ask you to train that! Feel free to close, and thank you for this great contribution!

jhyuklee · 2019-01-30T01:29:27Z

We have a plan for using a custom dictionary, but it will require much more GPU hours to pre-train such model compared to starting from the pre-trained BERT. We'll share it if it works. Thank you for your interest, and I'll close the issue.

phosseini · 2019-06-18T19:08:58Z

We have a plan for using a custom dictionary, but it will require much more GPU hours to pre-train such model compared to starting from the pre-trained BERT. We'll share it if it works. Thank you for your interest, and I'll close the issue.

I wonder if there's any update on using the custom dictionary and if it's a work in progress or on your TODO list?

jhyuklee · 2019-06-19T00:13:32Z

Hi @phosseini,
we are working on the custom vocabulary with BERT large version. It might take some time (maybe a couple of months) to find good pre-training steps.

Thanks.

lucky-bai · 2019-09-01T21:41:03Z

Hi @jhyuklee:

A random idea I had: would it be possible to use a custom vocabulary without redoing the BERT pretraining? One way to transfer the model onto a different vocabulary might proceed as follows:

Train a new Wordpiece vocab.
Use a similar technique as model distillation, to learn to replicate the same intermediate representation as the existing BERT, but using the new tokenizer. In other words, minimize the MSE between the old BERT after the first layer and the new BERT after the first layer. This paper uses a similar technique.
Do this for a number of iterations, only training the first layer while keeping all other layers fixed.

The benefit of this is to avoid most of the expensive BERT pre-training: only the first layer would be trained from scratch, rather than the whole model. Thoughts?

sebpretzer · 2020-05-14T01:17:25Z

Hi @jhyuklee,
I saw you updated your weights with a custom vocabulary. I was wondering if you had any information on how you trained that model. Anything along the lines of:

How long did it take to train your model?
I assume you still used the 8xV100 machine for training?
Did you use the original BERT dataset in your pre-training (wikipedia + bookcorpus)?

Thank you!

joelkuiper changed the title ~~Is the vocab correct?~~ Is the vocab.txt correct? Jan 29, 2019

jhyuklee closed this as completed Jan 30, 2019

mikerossgithub mentioned this issue Feb 1, 2019

Why does vocab not appear to be medically oriented? dmis-lab/biobert#4

Closed

jhyuklee mentioned this issue Mar 31, 2019

Domain Specific Pre-training Model #4

Closed

anjani-dhrangadhariya mentioned this issue Oct 24, 2019

Files for BioBERT tokenizer #11

Closed

EmilyAlsentzer mentioned this issue Dec 23, 2020

Vocabulary for the pre-trained model is not updated ? Any reason why EmilyAlsentzer/clinicalBERT#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the vocab.txt correct? #1

Is the vocab.txt correct? #1

joelkuiper commented Jan 29, 2019 •

edited

Loading

jhyuklee commented Jan 29, 2019

joelkuiper commented Jan 29, 2019 •

edited

Loading

jhyuklee commented Jan 30, 2019

phosseini commented Jun 18, 2019

jhyuklee commented Jun 19, 2019

lucky-bai commented Sep 1, 2019

sebpretzer commented May 14, 2020

Is the vocab.txt correct? #1

Is the vocab.txt correct? #1

Comments

joelkuiper commented Jan 29, 2019 • edited Loading

jhyuklee commented Jan 29, 2019

joelkuiper commented Jan 29, 2019 • edited Loading

jhyuklee commented Jan 30, 2019

phosseini commented Jun 18, 2019

jhyuklee commented Jun 19, 2019

lucky-bai commented Sep 1, 2019

sebpretzer commented May 14, 2020

joelkuiper commented Jan 29, 2019 •

edited

Loading

joelkuiper commented Jan 29, 2019 •

edited

Loading