Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacy version mismatch causes failure #3

Open
dresen opened this issue Aug 24, 2022 · 1 comment
Open

Spacy version mismatch causes failure #3

dresen opened this issue Aug 24, 2022 · 1 comment

Comments

@dresen
Copy link

dresen commented Aug 24, 2022

  • TextAnonymization version:
  • Python version: 3.8.13
  • Operating System: Windows, Ubuntu Linux

Description

I installed DaAnonymization with pip a week ago an tried to run your example from the readme, but it fails becaue of some mismatch between DaCy Large and current spacy version.

What I Did

The script anon_test.py:

from textprivacy import TextAnonymizer

# list of texts (example with cross-lingual transfer to english)
corpus = [
    "Hej, jeg hedder Martin Jespersen og er fra Danmark og arbejder i "
    "Deloitte, mit cpr er 010203-2010, telefon: +4545454545 "
    "og email: martin.martin@gmail.com",
    "Hi, my name is Martin Jespersen and work in Deloitte. "
    "I used to be a PhD. at DTU in Machine Learning and B-cell immunoinformatics "
    "at Anker Engelunds Vej 1 Bygning 101A, 2800 Kgs. Lyngby.",
]

Anonymizer = TextAnonymizer(corpus)

# Anonymize person, location, organization, emails, CPR and telephone numbers
anonymized_corpus = Anonymizer.mask_corpus()

for text in anonymized_corpus:
    print(text)
(anon): ~$ /home/akirkedal/software/anaconda/envs/anon/bin/python /home/akirkedal/software/anon/anon_test.py
/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py:762: UserWarning: [W095] Model 'da_dacy_large_tft' (0.0.0) was trained with spaCy v3.0 and may not be 100% compatible with the current version (3.1.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "/home/akirkedal/software/anon/anon_test.py", line 1, in <module>
    from textprivacy import TextAnonymizer
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/textprivacy/__init__.py", line 7, in <module>
    from textprivacy.textanonymization import TextAnonymizer
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/textprivacy/textanonymization.py", line 34, in <module>
    ner_model = dacy.load("da_dacy_large_tft-0.0.0")
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/dacy/load.py", line 39, in load
    return spacy.load(path)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/__init__.py", line 51, in load
    return util.load_model(
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 351, in load_model
    return load_model_from_path(Path(name), **kwargs)  # type: ignore[arg-type]
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 418, in load_model_from_path
    return nlp.from_disk(model_path, exclude=exclude, overrides=overrides)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/language.py", line 2021, in from_disk
    util.from_disk(path, deserializers, exclude)  # type: ignore[arg-type]
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 1229, in from_disk
    reader(path / key)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/language.py", line 2015, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(  # type: ignore[misc]
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy_transformers/pipeline_component.py", line 402, in from_disk
    util.from_disk(path, deserialize, exclude)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy/util.py", line 1229, in from_disk
    reader(path / key)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy_transformers/pipeline_component.py", line 391, in load_model
    tokenizer, transformer = huggingface_from_pretrained(
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/spacy_transformers/util.py", line 31, in huggingface_from_pretrained
    tokenizer = AutoTokenizer.from_pretrained(str_path, **config)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 568, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1732, in from_pretrained
    return cls._from_pretrained(
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1850, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 134, in __init__
    super().__init__(
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 829, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 375, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 52, in <module>
    _descriptor.EnumValueDescriptor(name="UNIGRAM", index=0, number=1, options=None, type=None),
  File "/home/akirkedal/software/anaconda/envs/anon/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 755, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates```
@martincjespersen
Copy link
Owner

I believe it was simply dacy which changed a bit in its installation. I tried in on my mac and published a 0.1.1 version @dresen. Please let me know if it solved it for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants