Skip to content
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.

Adding new terms into pre-trained model vocab | Issue in tokenizing OOV keywords #63

Open
spate141 opened this issue Feb 24, 2020 · 0 comments

Comments

@spate141
Copy link

I've trained a tokenizer with 50k vocab and over 500M sentences. I'm in a situation where I'm encoding many keywords that contains OOV tokens which the tokenizer is doing not-so-good job in tokenizing. I was wondering if there's any way to perhaps introducing an option to allow users to modify the vocab after the tokenizer is trained. I've seen the issue where the discussion was to train a tokenizer on data that contains these oov terms in some range (1000?), so that the tokenizer can identify them during training and can add it them to the vocab. But the issue here is, there is no determined way to know which of these terms needs to included in training data! Any thoughts on how to handle such situations?

model.encode([
    '1997',
    '1998',
    '1996',
    '1999',
    '1994'
])

Generates following tokens:

[
    [137, 1],
    [137, 1], 
    [137, 1], 
    [137, 1], 
    [137, 1]
]
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant