Skip to content

Simple and memory-efficient word extractor for Wiktionary

License

Notifications You must be signed in to change notification settings

roadkell/wiktion

Repository files navigation

Simple and memory-efficient word extractor for Wiktionary

  // ___  ___ __   __   \\
 //   \\   \\ /    /     \\
//     \\   \\    /       \\
\\      \\  /\\  /        //
 \\      \\/  \\/iktion  //
  \\                    //

What

This is a small tool for extracting a list of all words from Wiktionary dumps, with optional regexp filtering.

It is not a full-featured parser/extractor for Wiktionary data. It doesn't extract definitions, translations, synonyms, etc. If you need that, check out other projects.

Currently, only ru-wiktionary dumps are supported. More languages will (hopefully) follow.

How

python3 wiktion.py [-h] [-l LANG] [-p POS] [-r REGEX] infile [outfile]

positional arguments:
	infile                      Wiktionary XML dump file (bz2-compressed), e.g.,
	                            'ruwiktionary-latest-pages-articles.xml.bz2'
	outfile                     list of extracted words (plain text)

options:
	-h, --help                  show this help message and exit
	-l LANG, --lang LANG        filter words by language, e.g., 'ru', 'en'
	-p POS, --pos POS           filter by part of speech, e.g.,
	                            'сущ', 'гл', 'adv' (sic), 'прил'
	-r REGEX, --regex REGEX     optional regex string to filter page text by

Dumps can be downloaded at https://dumps.wikimedia.org/.

The required dumps are named as [lang]wiktionary-[date|latest]-pages-articles[-multistream].xml.bz2 (e.g., ruwiktionary-latest-pages-articles.xml.bz2 or ruwiktionary-20220720-pages-articles-multistream.xml.bz2)

Other projects

Even more:

License

GNU General Public License v3.0

lxml: BSD

tqdm: MIT

About

Simple and memory-efficient word extractor for Wiktionary

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages