Skip to content

Fast generation of RDF2Vec embeddings with a SPARQL endpoint

Bram Steenwinckel edited this page Jan 18, 2021 · 12 revisions

Fast generation of embeddings using a SPARQL endpoint

On this page, we provide a guideline to:

  1. setup a DBPedia endpoint using a Stardog RDF store
  2. create a custom random walker which uses Python's multiprocessing library
  3. Tip and Tricks
  4. Benchmark results

DBPedia SPARQL endpoint

We will load DBPedia into Stardog because we have already a lot of expertise working with it. Of course, you can load the data in any other RDF store supporting a SPARQL endpoint.

Install Stardog

Stardog can be curled through: https://www.stardog.com/get-started/ after which you can unzip the downloaded file. (keep in mind that Stardog will require java 8 to work properly)

Because a lot of triples will be loaded, we must make sure Stardog can use all the available resources of our server. Therefore, it is necessary to correctly set the STARDOG_SERVER_JAVA_ARGS according to the following table

# of Triples JVM Heap Memory Direct memory Total System Memory
100 million 3GB 4GB 8GB
1 billion 8GB 20GB 32GB
10 billion 30GB 80GB 128GB
25 billion 60GB 160GB 256GB
50 billion 80GB 380GB 512GB

Our server setup had 32GB of ram available, so we performed: export STARDOG_SERVER_JAVA_ARGS=-Xms8g -Xmx8g -XX:MaxDirectMemorySize=20g on our server, which enabled us to load 1 billion triples.

As we will load multiple files with a large amount of triples, we will first set the Stardog server into bulk loading mode. Bulk mode can be easily enabled by creating in your STARDOG_HOME folder a stardog.properties file (if you did not defined this folder, your STARDOG_HOME folder will be the same folder where you executed the curl command above).

In this stardog.properties file, you paste the following two lines:

memory.mode = bulk
strict.parsing = false

Strict parsing will disable the parsing checks (some DBPedia triples violate the predefined, ontological rules) Other settings can be listed in this properties as file as well.

Now, you can start the Stardog database by executing the following command: ./stardog-7.4.5/bin/stardog-admin server start --disable-security --no-cors

The first time you start Stardog, it will ask to install a license. Just answer the questions asked in the terminal. Academic users can have Stardog 1 year for free, other people for 60 days.

keep in mind that your Stardog version number can be different. By default, we both disable the servers security and cors as we are not planning to make our database public. We refer to the Stardog documentation sites for more information (https://www.stardog.com/docs)

Load DBPedia

The following script can be used to download all the relevant triples for the October 2015 english version of DBPedia (the 2015 version is the one used in the original RDF2Vec paper).

mkdir -p data
cd data

mkdir core
cd core
wget -np -nd -r -A ttl.bz2 -A nt.bz2 "http://downloads.dbpedia.org/2015-10/core/"
cd ..

mkdir core-i18n
cd core-i18n
wget -nd -np -r -A ttl.bz2 "http://downloads.dbpedia.org/2015-10/core-i18n/en/"
cd ..

wget -nd -np -r -A .owl "http://downloads.dbpedia.org/2015-10/dbpedia_2015-10.owl"

Again, you can edit this script to install other version or other languages if needed.

Multiple ttl.bz2 and nt.bz2 files will be downloaded in a newly created data folder. Stardog can directly load these bz2 files, so you don't have to decompress them.

To load them into Stardog, you can use the following commands:

./stardog-7.4.5/bin/stardog-admin db create -n dbpedia $(find . -name \*.bz2 -print -type f | xargs)
./stardog-7.4.5/bin/stardog data add dbpedia data/dbpedia_2015-10.owl

We recommand you to run these commands in a seperate screen or tmux terminal, such that you can follow the progress of loading the database by using tail -f stardog.log (ctrl-c to quit)

Grab a coffee, this took ± 2 hours for our setup (32gb ram).

Sparql endpoint

After all triples are loaded, it is better to tear down the Stardog server using:

./stardog-7.4.5/bin/stardog-admin server stop

and change the stardog.properties memory.mode to:

memory.mode = default

This will rebalance the provided 32gb RAM to make sure SELECT queries can be performed optimal. Now you can start the Stardog service again using ./stardog-7.4.5/bin/stardog-admin server start --disable-security --no-cors and you are ready to go.

Simple test

To use this Stardog service as a remote KG in our pyrdf2vec library, you can use some code like described below:

from pyrdf2vec.graphs import KG
from pyrdf2vec.samplers import UniformSampler
from pyrdf2vec.walkers import RandomWalker
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec

kg = KG(location="http://YOUR_STARDOG_IP_OR_LOCALHOST:5820/dbpedia/query", is_remote=True)

walkers = [RandomWalker(1, 200, UniformSampler())]
embedder = Word2Vec(size=200)
transformer = RDF2VecTransformer(walkers=walkers, embedder=embedder)

embeddings = transformer.fit_transform(kg, ['http://dbpedia.org/resource/Brussels'])
print(embeddings)

Make sure the IP adress of your Stardog service is correclty filled in. Stardog also runs on port 5820 by default. The /dbpedia/query defines the SPARQL endpoint of our database dbpedia which we used when loading the dbpedia triples. Stardog can also host multiple database on a single service.

Multi-processed RandomWalker

As you will notice, executing SPARQL request result in some delays which increases the time to generate the embeddings. Therefore, the RandomWalker code can be extended with both a tqdm progress bar to show the current state and can be multi-processed using the default Python MultiProcessing library to reduce these delays by executing code in multiple processes.

The code to create such a RandomWalker is defined below:

from multiprocessing import Pool
from hashlib import md5
from typing import List,Set, Tuple, Any
from tqdm import tqdm
import rdflib

class MultiProcessingRandomWalker(RandomWalker):
    def _proc(self, t):
        kg, instance = t
        walks = self.extract_random_walks(kg, instance)
        canonical_walks = set()
        for walk in walks:
            canonical_walk = []
            for i, hop in enumerate(walk):  # type: ignore
                if i == 0 or i % 2 == 1:
                    canonical_walk.append(str(hop))
                else:
                    digest = md5(str(hop).encode()).digest()[:8]
                    canonical_walk.append(str(digest))
            canonical_walks.add(tuple(canonical_walk))

        return {instance:tuple(canonical_walks)}

    #overwrite this method
    def _extract(self, kg: KG, instances: List[rdflib.URIRef]) -> Set[Tuple[Any, ...]]:
        canonical_walks = set()
        seq = [(kg, r) for _,r in enumerate(instances)]
        print(self.depth)
        with Pool(4) as pool:
            res = list(tqdm(pool.imap_unordered(self._proc, seq),
                            total=len(seq)))
        res = {k:v for element in res for k,v in element.items()}
        for r in instances:
            canonical_walks.update(res[r])

        return canonical_walks

By default, we use here 4 processors in a multiprocessing pool. The code executed by each processor is defined in the _proc function and is 100% equal to the original inner loop of the original RandomWalker in the random.py file.

you can use this MultiProcessingRandomWalker by simple providing it in the walkers argument list:

walkers = [MultiProcessingRandomWalker(1, 200, UniformSampler())]

Tip and Tricks

  • You will get the fastest results when using the MultiProcessingRandomWalker on the same machine as the Stardog service. This reduces the latencies introduced to send all the SPARQL responses over the network.
  • You can play with the number of processors, but the general rule is that you use 1 less than the number of processors on you machine. Python's multiprocessing library can be used for this: import multiprocessing.cpu_count; print(multiprocessing.cpu_count()-1)
  • Newer Python versions require that you run multi-processed code inside the main function. So if needed, encapsulate your code in if __name__ == '__main__':
  • MultiProcessingRandomWalker with the public DBPedia endpoint = bad idea.. You will make more requests per second than this public endpoint can handle properly (they fixed the numer of parallel executions). More concrete, you will be blocked. So use your own endpoint!

Benchmark results

Below, we provide some time results generated using the setup described above. We compare running our code from both the Stardog server and a laptop. The difference between these two results is that the code ran on the laptop will have delays because the SPARQL requests are sent over the network.

We also compare the influence of using the Multiprocessing module.

Our benchmark dataset is defined by a set of DBPedia cities. We created an 200x1 embedding with 200 random paths for each of the 212 listed cities in this benchmark dataset and report the average number of instances that we can process per second.

Depth/test PC (single) PC (4 cores) On Server (single) On Server (4 cores)
1 0.78it/s 2.58it/s 2.09it/s 8.14it/s
2 0.04it/s 0.22it/s 0.78it/s 2.85it/s
3 0.03it/s 0.14it/s 0.57it/s 2.05it/s
4 0.02it/s 0.12it/s 0.52it/s 1.91it/s