Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy violation by only offering online embeddings! #1057

Open
Bardo-Konrad opened this issue Apr 23, 2024 · 3 comments
Open

Privacy violation by only offering online embeddings! #1057

Bardo-Konrad opened this issue Apr 23, 2024 · 3 comments

Comments

@Bardo-Konrad
Copy link

Bardo-Konrad commented Apr 23, 2024

Document Embeddings does not allow local models and therefore creates a privacy hazard.

As I don't assume that this was done due to malicious design by the Bioinformatics Lab at University of Ljubljana, Slovenia, you need to fix this and enable local open source models.

@markotoplak
Copy link
Member

Thanks, we would also prefer to have a local option. Do you know of any small models that are easily pip-installable? Preferably not like 1GB dependency?

@Bardo-Konrad
Copy link
Author

Bardo-Konrad commented Apr 23, 2024

You could try Small Language Models like gemini Nano, orca-2-7b etc. and in general use spacy as in

# Install spacy
pip install -U spacy

# Download the small English model
python -m spacy download en_core_web_sm
import spacy

# Load the installed model
nlp = spacy.load("en_core_web_sm")

# Use the model
doc = nlp("This is a sentence.")

@janezd janezd transferred this issue from biolab/orange3 May 10, 2024
@ajdapretnar
Copy link
Collaborator

Spacy would be super beneficial for adding the named entity recognition option! Perhaps also a way to add Chinese tokenisation.
Note that Spacy would not cover 17 languages that FastText does (Catalan, Croatian, Lithuanian, Macedonian, Ukrainian, Arabic, Azerbaijani, Bengali, Hindi, Tajik, Turkish, Norwegian Nynorsk, Nepali, Kazakh, Indonesian, Hungarian, Hebrew) or other 25 languages that multilingual SBERT covers. However, as an option, it would be great to have!
Spacy's English model is 12 MB (the smallest model) + an added 11MB in Spacy dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants