Training Slovenian (and possibly other) customized lemmatize models produces incorrect predictions with <UNK> signs #75

lkrsnik · 2023-07-26T08:47:20Z

To Reproduce

trainer = trankit.TPipeline(
    training_config={
        'category': 'customized'
        'task': 'lemmatize',
        'save_dir': <PATH>,
        'train_conllu_fpath': <PATH>,
        'dev_conllu_fpath': <PATH>
    }
)
trainer.train()

Expected behavior
The trained model should produce lemmas with accuracy on par with the default Slovenian model.

Environment:

OS: Ubuntu 18.04.5 LTS
Python version: Python 3.9.16
Trankit version: 1.1.1

Temporary solution
A temporary fix has been added by modifying the following code:

import trankit
from trankit.utils.mwt_lemma_utils.seq2seq_utils import VOCAB_PREFIX, SOS, EOS

trankit.utils.mwt_lemma_utils.seq2seq_vocabs.EMPTY = SOS
trankit.utils.mwt_lemma_utils.seq2seq_vocabs.ROOT = EOS
trankit.utils.mwt_lemma_utils.seq2seq_vocabs.VOCAB_PREFIX = VOCAB_PREFIX

Note: The provided temporary solution seems to address the issue, but a more permanent fix may be required in the Trankit library to avoid the need for this workaround.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Slovenian (and possibly other) customized lemmatize models produces incorrect predictions with <UNK> signs #75

Training Slovenian (and possibly other) customized lemmatize models produces incorrect predictions with <UNK> signs #75

lkrsnik commented Jul 26, 2023

Training Slovenian (and possibly other) customized lemmatize models produces incorrect predictions with <UNK> signs #75

Training Slovenian (and possibly other) customized lemmatize models produces incorrect predictions with <UNK> signs #75

Comments

lkrsnik commented Jul 26, 2023