Vocabulary for the pre-trained model is not updated ? Any reason why #31

NeverInAsh · 2020-12-16T03:23:03Z

Thanks for making such a comprehensive bert model.

I am worried about the actual words that I find in the model though.
Author mentions that "The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC". I am supposing this would have mean that the vocab will also be updated.

But when i see the vocabulary words, I don't see medical concepts.

from transformers import TFBertModel,  BertConfig, BertTokenizerFast
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizerFast.from_pretrained('emilyalsentzer/Bio_ClinicalBERT')
tokenizer.vocab.keys()

['Cafe', 'locomotive', 'sob', 'Emilio', 'Amazing', '##ired', 'Lai', 'NSA', 'counts', '##nius', 'assumes', 'talked', 'ク', 'rumor', 'Lund', 'Right', 'Pleasant', 'Aquino', 'Synod', 'scroll', '##cope', 'guitarist', 'AB', '##phere', 'resulted', 'relocation', 'ṣ', 'electors', '##tinuum', 'shuddered', 'Josephine', '"', 'nineteenth', 'hydroelectric', '##genic', '68', '1000', 'offensive', 'Activities', '##ito', 'excluded', 'dictatorship', 'protruding', '1832', 'perpetual', 'cu', '##36', 'outlet', 'elaborate', '##aft', 'yesterday', '##ope', 'rockets', 'Eduard', 'straining', '510', 'passion', 'Too', 'conferred', 'geography', '38', 'Got', 'snail', 'cellular', '##cation', 'blinked', 'transmitted', 'Pasadena', 'escort', 'bombings', 'Philips', '##cky', 'sacks', '##Ñ', 'jumps', 'Advertising', 'Officer', '##ulp', 'potatoes', 'concentration', 'existed', '##rrigan', '##ier', 'Far', 'models', 'strengthen', 'mechanics'...]

Am i missing something here ?

Also, is there any uncased version present for this model ?

The text was updated successfully, but these errors were encountered:

EmilyAlsentzer · 2020-12-21T16:12:53Z

You are correct that the clinicalBERT models use the exact same vocabulary as the original BERT models. This is because we first initialized the models with the BERT base parameters and then further trained the masked LM & next sentence prediction heads on MIMIC data. While training BERT from scratch on clinical data with a clinical vocabulary would certainly be better, training from scratch is very expensive (i.e. requires extensive GPU resources & time).

That being said, BERT uses word pieces for its vocabulary, rather than just whole words. Traditionally in NLP, any words not found in the vocabulary are represented as an UNKNOWN token. This makes it difficult to generalize to new domains. However, because BERT uses word pieces, this problem is not as severe. If a word does not appear in the BERT vocabulary during preprocessing, then the word is broken down to its word pieces. For example, penicillin may not be in the BERT vocabulary, but perhaps the word pieces "pen", "i", and "cillin" are present. In this example, the word piece "pen" would then likely have a very different contextual embedding in clinicalBERT compared to general domain BERT because it is frequently found in the context of a drug. In the paper, we show that the nearest neighbors of embeddings of disease & operations-related words make more sense when the words are embedded by clinicalBERT compared to bioBERT & general BERT.

Unfortunately, we don't have an uncased version of the model at this time.

Hope this helps!

NeverInAsh · 2020-12-22T08:50:31Z

Thanks for a very crisp reply. One question though, when you say

nearest neighbors of embeddings of disease & operations-related words make more sense when the words are embedded by clinicalBERT compared to bioBERT.

Is there any non-empirical explanation for it ? Bio-bert seems to have a custom vocabulary and covers many concepts. I am attaching an image about the same. These are the top 20 UMLs concepts by count in the vocabulary.

EmilyAlsentzer · 2020-12-23T14:14:46Z

I think BioBERT updated their model recently (or at least after clinicalBERT was published). The model we compared to in our paper had the same vocabulary as BERT. Check out the issue on their github where someone had a similar question to yours.

I do agree with you that a custom vocabulary would likely be better. I don't currently have the bandwidth to train it, but if you end up doing so, let us know!

EmilyAlsentzer closed this as completed Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary for the pre-trained model is not updated ? Any reason why #31

Vocabulary for the pre-trained model is not updated ? Any reason why #31

NeverInAsh commented Dec 16, 2020 •

edited

Loading

EmilyAlsentzer commented Dec 21, 2020

NeverInAsh commented Dec 22, 2020

EmilyAlsentzer commented Dec 23, 2020

Vocabulary for the pre-trained model is not updated ? Any reason why #31

Vocabulary for the pre-trained model is not updated ? Any reason why #31

Comments

NeverInAsh commented Dec 16, 2020 • edited Loading

EmilyAlsentzer commented Dec 21, 2020

NeverInAsh commented Dec 22, 2020

EmilyAlsentzer commented Dec 23, 2020

NeverInAsh commented Dec 16, 2020 •

edited

Loading