Skip to content

Changes to input format of pretokenized text

Compare
Choose a tag to compare
@BramVanroy BramVanroy released this 17 Jan 10:03
· 12 commits to master since this release

Since spaCy 3.2.0, the data that is passed to a spaCy pipeline has become more strict. This means that passing
a list of pretokenized tokens (["This", "is", "a", "pretokenized", "sentence"]) is not accepted anymore. Therefore,
the is_tokenized option needed to be adapted to reflect this. It is still possible to pass a string where tokens
are separated by whitespaces, e.g. "This is a pretokenized sentence", which will continue to work for spaCy and
stanza. Support for pretokenized data has been dropped for UDPipe.

Specific changes:

  • [conllparser] Breaking change: is_tokenized is not a valid argument to ConllParser any more.
  • [utils/conllparser] Breaking change: when using UDPipe, pretokenized data is not supported any more.
  • [utils] Breaking change: SpacyPretokenizedTokenizer.__call__ does not support a list of tokens any more.