tokenizer.add_tokens() interfers with downstream NER task #175

laurens777 · 2022-04-07T18:11:48Z

Goal: Add tokens to the tokenizer for clinical domain to prevent the tokenizer from tokenizing it.

I am using the following code to add a few tokens to the tokenizer:

tokenizer.add_tokens(["MV", "AV"])
model.resize_token_embeddings(len(tokenizer))

After fine-tuning the tokenizer no longer splits these tokens into single characters and keeps it as a single token. However, it no longer assigns the correct NER tag to the token. I have double checked my training data and there are no issues there. So this seems to be an error in the biobert code.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer.add_tokens() interfers with downstream NER task #175

tokenizer.add_tokens() interfers with downstream NER task #175

laurens777 commented Apr 7, 2022

tokenizer.add_tokens() interfers with downstream NER task #175

tokenizer.add_tokens() interfers with downstream NER task #175

Comments

laurens777 commented Apr 7, 2022