Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings for large KG #154

Open
deeplifyde opened this issue Oct 28, 2022 · 3 comments
Open

Embeddings for large KG #154

deeplifyde opened this issue Oct 28, 2022 · 3 comments
Labels
question Further information is requested

Comments

@deeplifyde
Copy link

❓ Question

Hello,
I'm currently trying to generate Embeddings for a graph with ~ 13 million entities and 15 million walks. I'm using a machine with 128 Gb RAM, however, the in-memory saved random walks overflow the memory. Is there a way to store them to disk and batch load them like e.g. in image processing pipelines?

@deeplifyde deeplifyde added the question Further information is requested label Oct 28, 2022
@GillesVandewiele
Copy link
Collaborator

GillesVandewiele commented Oct 28, 2022

Yes, most definitely! Within the fit() function, the extract_walks() function is called that returns a list of walks that are fed to Word2Vec later on.

https://github.com/IBCNServices/pyRDF2Vec/blob/main/pyrdf2vec/rdf2vec.py#L107

You could call this function on different chunks of entities to do this iteratively. We also have some mechanisms to speed up extraction and/or reduce memory usage:

  • Hashing of urls (less bytes needed)
  • Using HDT to speed up KG loading time, or even better, directly querying a locally hosted SPARQL endpoint (no loading time)
  • Extracting a limited amount of walks per entity, but this reduces accuracy later on
  • Excluding certain types of walks (forbidden predicates) of which you know they do not carry much information
  • Etc.

However, in the end, you will need to feed all these walks AT ONCE to a Word2Vec or other word embedding network, which could be the main bottleneck... For this, you could write custom data loaders that are responsible for preparing 1 or more batches for the network. This seems to be supported in gensim (and it most definitely is for Keras/Torch/...): https://stackoverflow.com/questions/63459657/how-to-load-large-dataset-to-gensim-word2vec-model

So you could read a file from disk in that dataloader and serve it to Word2Vec

@deeplifyde
Copy link
Author

Thank you for the fast answer!

Using the is_update=True on chunks of the data is not possible ? Because the underlying model will store the walks for the old entities too ?

@GillesVandewiele
Copy link
Collaborator

It's not ideal, as Word2Vec doesn't support iterative updating that well (it is possible but suboptimal). I think the custom data loader will provide better results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants