Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.add_tokens() interfers with downstream NER task #175

Open
laurens777 opened this issue Apr 7, 2022 · 0 comments
Open

tokenizer.add_tokens() interfers with downstream NER task #175

laurens777 opened this issue Apr 7, 2022 · 0 comments

Comments

@laurens777
Copy link

Goal: Add tokens to the tokenizer for clinical domain to prevent the tokenizer from tokenizing it.

I am using the following code to add a few tokens to the tokenizer:

tokenizer.add_tokens(["MV", "AV"])
model.resize_token_embeddings(len(tokenizer))

After fine-tuning the tokenizer no longer splits these tokens into single characters and keeps it as a single token. However, it no longer assigns the correct NER tag to the token. I have double checked my training data and there are no issues there. So this seems to be an error in the biobert code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant