Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Slovenian (and possibly other) customized lemmatize models produces incorrect predictions with <UNK> signs #75

Open
lkrsnik opened this issue Jul 26, 2023 · 0 comments

Comments

@lkrsnik
Copy link

lkrsnik commented Jul 26, 2023

To Reproduce

trainer = trankit.TPipeline(
    training_config={
        'category': 'customized'
        'task': 'lemmatize',
        'save_dir': <PATH>,
        'train_conllu_fpath': <PATH>,
        'dev_conllu_fpath': <PATH>
    }
)
trainer.train()

Expected behavior
The trained model should produce lemmas with accuracy on par with the default Slovenian model.

Environment:

  • OS: Ubuntu 18.04.5 LTS
  • Python version: Python 3.9.16
  • Trankit version: 1.1.1

Temporary solution
A temporary fix has been added by modifying the following code:

import trankit
from trankit.utils.mwt_lemma_utils.seq2seq_utils import VOCAB_PREFIX, SOS, EOS

trankit.utils.mwt_lemma_utils.seq2seq_vocabs.EMPTY = SOS
trankit.utils.mwt_lemma_utils.seq2seq_vocabs.ROOT = EOS
trankit.utils.mwt_lemma_utils.seq2seq_vocabs.VOCAB_PREFIX = VOCAB_PREFIX

Note: The provided temporary solution seems to address the issue, but a more permanent fix may be required in the Trankit library to avoid the need for this workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant