Skip to content

Releases: BramVanroy/spacy_conll

v4.0.0

02 Jul 08:38
Compare
Choose a tag to compare

What's Changed

Two new changes thanks to user @rominf:

  1. Repackaged the library to bring it up to modern standards, notably relying on a pyproject.toml file and
    removing support for Python <3.8.
  2. When dep, pos, tag, or lemma fields are empty, the underscore _ will be used

New Contributors

Full Changelog: v3.4.0...v4.0.0

Update default field names and allow custom ones

07 Apr 13:21
Compare
Choose a tag to compare

What's Changed

Full Changelog: v3.3.0...v3.4.0

Changes to input format of pretokenized text

17 Jan 10:03
Compare
Choose a tag to compare

Since spaCy 3.2.0, the data that is passed to a spaCy pipeline has become more strict. This means that passing
a list of pretokenized tokens (["This", "is", "a", "pretokenized", "sentence"]) is not accepted anymore. Therefore,
the is_tokenized option needed to be adapted to reflect this. It is still possible to pass a string where tokens
are separated by whitespaces, e.g. "This is a pretokenized sentence", which will continue to work for spaCy and
stanza. Support for pretokenized data has been dropped for UDPipe.

Specific changes:

  • [conllparser] Breaking change: is_tokenized is not a valid argument to ConllParser any more.
  • [utils/conllparser] Breaking change: when using UDPipe, pretokenized data is not supported any more.
  • [utils] Breaking change: SpacyPretokenizedTokenizer.__call__ does not support a list of tokens any more.

Entry points and quality of life improvements

04 Apr 12:16
Compare
Choose a tag to compare
  • [conllformatter] Fixed an issue where SpaceAfter=No was not added correctly to tokens
  • [conllformatter] Added ConllFormatter as an entry point, which means that you do not have to import
    spacy_conll anymore when you want to add the pipe to a parser! spaCy will know where to look for the CoNLL
    formatter when you use nlp.add_pipe("conll_formatter") without you having to import the component manually
  • [conllformatter] Now adds the component constructor on a construction function rather than directly on the class
    as recommended by spacy. The formatter has also been re-written as a dataclass
  • [conllformatter/utils] Moved merge_dicts_strict to utils, outside the formatter class
  • [conllparser] Make ConllParser directly importable from the root of the library, i.e.,
    from spacy_conll import ConllParser
  • [init_parser] Allow users to exclude pipeline components when using the spaCy parser with the
    exclude_spacy_components argument
  • [init_parser] Fixed an issue where disabling sentence segmentation would not work if your model does
    not have a parser
  • [init_parser] Enable more options when using stanza in terms of pre-segmented text. Now you can also disable
    sentence segmentation for stanza (but still do tokenization) with the disable_sbd option
  • [utils] Added SpacyDisableSentenceSegmentation as an entry-point custom component so that you can use it in your
    own code, by calling nlp.add_pipe("disable_sbd", before="parser")

Fix no_split_on_newline

14 Jul 15:16
Compare
Choose a tag to compare
  • [conllparser] Fix: fixed an issue with no_split_on_newline in combination with nlp.pipe

Bugfix for ConllParser: do not require stanza and udpipe

14 Jul 05:50
Compare
Choose a tag to compare
  • [conllparser] Fix: make sure the parser also runs if stanza and UDPipe are not installed

Release for spaCy v3

12 Jul 10:17
Compare
Choose a tag to compare

This release makes spacy_conll compatible with spaCy's new v3 release. On top of that some improvements were made to make the project easier to maintain.

  • [general] Breaking change: spaCy v3 required (closes #8)
  • [init_parser] Breaking change: in all cases, is_tokenized now disables sentence segmentation
  • [init_parser] Breaking change: no more default values for parser or model anywhere. Important to note here that
    spaCy does not work with short-hand codes such as en any more. You have to provide the full model name, e.g.
    en_core_web_sm
  • [init_parser] Improvement: models are automatically downloaded for Stanza and UDPipe
  • [cli] Reworked the position of the CLI script in the directory structure as well as the arguments. Run
    parse-as-conll -h for more information.
  • [conllparser] Made the ConllParser class available as a utility to easily create a wrapper for a spaCy-like
    parser which can return the parsed CoNLL output of a given file or text
  • [conllparser,cli] Improvements to usability of n_process. Will try to figure out whether multiprocessing
    is available for your platform and if not, tell you so. Such a priori error messages can be disabled, with
    ignore_pipe_errors, both on the command line as in ConllParser's parse methods

Preparing for v3 release

23 Jun 13:00
Compare
Choose a tag to compare
  • Last version to support spaCy v2. New versions will require spaCy v3
  • Last version to support spacy-stanfordnlp. spacy-stanza is still supported

Stanza and UDPipe support, easy-to-use utility function, Token-attributes, and more

11 May 17:36
Compare
Choose a tag to compare

Fully reworked version!

  • Tested support for both spacy-stanza and spacy-udpipe! (Not included as a dependency, install manually)
  • Added a useful utility function init_parser that can easily initialise a parser together with the custom
    pipeline component. (See the README or examples)
  • Added the disable_pandas flag the the formatter class in case you would want to disable setting the pandas
    attribute even when pandas is installed.
  • Added custom properties for Tokens as well. So now a Doc, its sentence Spans as well as Tokens have custom attributes
  • Reworked datatypes of output. In version 2.0.0 the data types are as follows:
    • ._.conll: raw CoNLL format
      • in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as
        values.
      • in sentence Span: a list of its tokens' ._.conll dictionaries (list of dictionaries).
      • in a Doc: a list of its sentences' ._.conll lists (list of list of dictionaries).
    • ._.conll_str: string representation of the CoNLL format
      • in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline.
      • in sentence Span: the expected CoNLL format where each row represents a token. When
        ConllFormatter(include_headers=True) is used, two header lines are included as well, as per the
        CoNLL format_.
      • in Doc: all its sentences' ._.conll_str combined and separated by new lines.
    • ._.conll_pd: pandas representation of the CoNLL format
      • in Token: a Series representation of this token's CoNLL properties.
      • in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column
        headers.
      • in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose
        index is reset.
  • field_names has been removed, assuming that you do not need to change the column names of the CoNLL properties
  • Removed the Spacy2ConllParser class
  • Many doc changes, added tests, and a few examples

Add SpaceAfter=No property

28 Apr 08:29
Compare
Choose a tag to compare
  • IMPORTANT: This will be the last release that supports the deprecated Spacy2ConllParser class!
  • Community addition: add SpaceAfter=No to the Misc field when applicable (#6). Thanks @KoichiYasuoka!
  • Fixed failing tests