wikipedia-dump-converter

A very specific tool, that converts the SQL dump of Wikipedia into RDF files suitable to be imported by dgraph.

For now, it only converts the page and the pagelinks tables. This is because I'm only interested in the links between pages (and especially between encyclopedia pages). I might add support for the category and categorylinks tables.

Installation

It's written in rust and makes use of cargo. So just type:

$ git clone https://github.com/Picani/wikipedia-dump-converter.git
$ cd wikipedia-dump-converter
$ cargo build --release

The executable is now target/release/wikipedia-dump-converter. Move it somewhere on your PATH.

Usage

Print the help with:

$ wikipedia-dump-converter -h

First, convert the page table dump:

$ wikipedia-dump-converter -i pages -e page_table_dump.sql.gz converted_pages.rdf.gz

Remove the -i argument to stop when encountering a text encoding error in the dump, instead of printing it and continuing.

Remove the -e argument to also convert non-encyclopedia pages (like user pages, help pages, etc).

The resulting file looks like the following:

$ zcat converted_pages.rdf.gz | head
<3> <namespace> "0" .
<3> <title> "Antoine Meillet" .
<7> <namespace> "0" .
<7> <title> "Algèbre linéaire" .
<9> <namespace> "0" .
<9> <title> "Algèbre générale" .
<10> <namespace> "0" .
<10> <title> "Algorithmique" .
<11> <namespace> "0" .
<11> <title> "Politique en Argentine" .

In RDF triple terminology, the subject is the page unique ID, the predicate is either title or namespace and the object is either the page title (when the predicate is title) or the namespace unique ID (when the predicate is namespace). The list of namespaces is available here.

Then, convert the pagelinks table dump:

$ wikipedia-dump-converter -i links pagelinks_table_dump.sql.gz converted_page.rdf.gz converted_links.rdf.gz

Again, remove the -i argument to stop when encountering a text encoding error in the dump, instead of printing it and continuing.

The resulting file looks like the following:

$ zcat converted_links.rdf.gz | head
<177374> <linksto> <222657> .
<315352> <linksto> <222657> .
<1175072> <linksto> <222657> .
<3578724> <linksto> <222657> .
<7917621> <linksto> <222657> .
<222376> <linksto> <4433171> .
<4452220> <linksto> <4433171> .
<7563679> <linksto> <4433171> .
<7591490> <linksto> <4433171> .
<90880> <linksto> <351979> .

In RDF triple terminology, the subject is the page unique ID from which the link starts, the predicate is linksto, and the object is the unique ID of the page pointed to by the link.

Note: Only the links for which both pages are present in converted_pages.rdf.gz are converted.

License

This work is free. You can redistribute it and/or modify it under the terms of the Do What The Fuck You Want To Public License, Version 2, as published by Sam Hocevar. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikipedia-dump-converter

Installation

Usage

License

About

Releases

Packages

Languages

License

Picani/wikipedia-dump-converter

Folders and files

Latest commit

History

Repository files navigation

wikipedia-dump-converter

Installation

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages