Skip to content

Commit

Permalink
If a token is completely erased by the MWT seq2seq, restore it as a s…
Browse files Browse the repository at this point in the history
…ingle word. #1401
  • Loading branch information
AngledLuffa committed Jul 15, 2024
1 parent 657f467 commit 384d65c
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions stanza/models/mwt/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,10 @@ def predict(self, batch, unsort=True):
pred_tokens.append("".join(pred_seq))
else:
pred_tokens = ["".join(seq) for seq in pred_seqs] # join chars to be tokens
# if any tokens are predicted to expand to blank,
# that is likely an error. use the original text
# this originally came up with the Spanish model turning 's' into a blank
pred_tokens = [x if x else y for x, y in zip(pred_tokens, orig_text)]
if unsort:
pred_tokens = utils.unsort(pred_tokens, orig_idx)
return pred_tokens
Expand Down

0 comments on commit 384d65c

Please sign in to comment.