Skip to content

Framework for the extraction of features from Wikipedia XML dumps.

License

Notifications You must be signed in to change notification settings

samuelebortolotti/wikidump-lang-breaks-warns

 
 

Repository files navigation

Wikidump

Framework for the features extraction from Wikipedia XML dumps.

Installation

This project has been tested with Python 3.5.0 and Python 3.8.5.

You need to install dependencies first, as usual.

pip install -r requirements.txt

Usage

First of all, download Wikipiedia dumps:

./download.sh

Then run the extractor:

python -m wikidump [PROGRAM_OPTIONS] FILE [FILE ...]  OUTPUT_DIR [PROGRAM_OPTIONS] FUNCTION [FUNCTION_OPTIONS]

You can also run the program by using the Makefile and GNU/Make (edit the file in order to run the program with the desired parameters). For example, you can run the program on the English dumps by typing:

make run-en

Example of use

Retrieve the languages known by each user according to the last revision

If you are interested in extracting the languages known by the Catalan Wikipedia's users, you can type:

python -m wikidump --output-compression gzip dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output extract-known-languages --only-pages-with-languages --only-revisions-with-languages --only-last-revision

Retrieve the wikibreak history

So as to retrieve the wikibreaks and similar templates associated to the users within their user page and user talk page, you can type:

python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_wikibreaks --output-compression gzip extract-wikibreaks --only-pages-with-wikibreaks

The example above shows the language extraction considering the Catalan Wikipedia.

Retrieve options and occurences of the user warnings transcluded templates

In order to retrieve transcluded user warnings templates and their associated parameters within user talk pages, you can run the following Python command:

python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_user_warnings_transcluded --output-compression gzip extract-user-warnings --only-pages-with-user-warnings

The example, shown above, illustrates the template extraction considering the Catalan Wikipedia.

Retrieve regular expressions form the user warnings templates

This command aims to produce regular expressions to detect a substituted user warnings template (using the subst function) within user talk pages.

Unfortunately, for the sake of semplicity, the subst-chain is not handled by this Python code.

To run the script, you run the following command:

python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_user_warnings_regex --output-compression gzip extract-user-warnings-templates --esclude-template-repetition --set-interval '1 week'

The example above shows the regular expressions produced considering the Catalan Wikipedia.

The previous command will ignore revisions, in which the template has not changed. The script will group the changes by weeks, therefore, if we consider a single week, the script will return the latest revision among all the ones made within seven days.

Please note: regular expressions have not been tested, since the work would have been tough and time consuming, therefore I can not assure the outcomes is totally correct

Retrieve the salient words of a user warnings template

In order to find the most salient words which best characterize the user warnings templates, you can run the following command:

python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_user_warnings_tokens --output-compression gzip extract-user-warnings-templates-tokens --esclude-template-repetition --set-interval '1 week' --language catalan

The example above shows the most salient words extraction considering the Catalan Wikipedia.

The previous command will ignore revisions, in which the template has not changed. The script will group the changes by weeks, therefore, if we consider a single week, the script will return the latest revision among all the ones made within seven days.

How the algorithm chooses the most salient words

First of all, the punctuation and symbols are removed from each template. Secondly, the stopwords of the chosen language are removed. Subsequently, if the appropriate flag is set, every word left is stemmed. Finally, the value of the tf-idf metric for each word within all the revisions is calculated. The corpus considered is made up of the set of template text of the revisions selected for that template. At this point, we define N as the number of words which makes up the revision of the template and X the number of documents in the corpus. Let's consider 2*X documents per template: X elements are randomly taken from other templates of the same language to avoid the idf value being too small (in the worst case 0) for the templates which change infrequently. The K words with the highest tf-idf value for revision are then selected, where K changes from revision to revision and is equal to N/2.

Probabilistic way to retrieve the occurences of user warnings

To find substituted user warnings in a probabilistic way, with the possibility of false positives cases, you can run the following command:

python -m wikidump dumps/cawiki/20210201/cawiki-20210201-pages-meta-history.xml.7z output_user_warnings_probabilistic --output-compression gzip extract-user-warnings-templates-probabilistic --only-pages-with-user-warnings --language catalan output_tokens/cawiki-20210201-pages-meta-history.xml.7z.features.json.gz --only-last-revision

The example above will use of the words extracted from the extract-user-warnings-templates-tokens command by passing the output files as a parameter. The objective is to find all the salient words of a template within the user talk page; if the aim is successfully reached, the template is marked as found and, after that, the salient words found will be printed.

Retrieve the name of a Wikipedia template in different languages

Firstly, you need to find the Wikidata item code of the template; for example the code for the wikibreak is Q5652064(retrieved from the corresponding wikidata page).

Secondly, you need to install the development dependencies

pip install -r requirements.dev.txt

Finally, run the following python command and giving it the template code

python utils/get_template_names.py WIKIDATA-TEMPLATE-CODE

Data

The documentation, regarding the the produced data and the refactored one, is shown here data documentation.

How to merge and refactor the raw data

In order to merge all the fragments, into which the dump is divided and to make the produced file more manageable, you can use the Python scripts, present in the utils/dataset_handler folder, in sequence. As for the previous case, it is possible and recommended to use a Makefile; only after having edited it, can you simply type:

make run

Metrics

utils/dataset_handler contains also some scripts to upload some metrics on a Postgres database, all you need to do in order to produce them is to run the following command:

python utils/metrics_loader/..metrics.py DATASET_LOCATION DATABASE_NAME POSTGRES_USER POSTGRES_USER_PASSWORD POSTGRES_PORT
  • DATASET_LOCATION refers to the path where the compressed JSON file is stored. Undoubtedly you need to pass the correct dataset according to the metrics which you are willing to compute.
  • DATABASE_NAME refers to the name of the Postgres database you are willing to use
  • POSTGRES_USER refers to the name of the Postgres user you are willing to use
  • POSTGRES_USER_PASSWORD refers to the password of the previously defined user
  • POSTGRES_PORT refers to the Postgres process port

The produced metrics will be the following:

  • Number of wikibreak templates registered in a given month, and the cumulative amount up to that point
  • Number of user warnings templates registered in a given month, and the cumulative amount up to that point

They will have the following schema:

id name year month category uw_category wikibreak_category1 wikibreak_category2 wikibreak_subcategory amount cumulative_amount
PK SERIAL TEXT INT INT TEXT TEXT TEXT TEXT TEXT INT INT

Run

In order to call the all the scripts on all the Wikipedia dump, you can run the following script

./run.sh

First of all, be sure you have modified all the readonly variables so as to fit your needs; feel free to change whatever you want.

The dependencies of the previously defined script are

Docker

So as to call the entire program in a Docker cotainer, a Dockerfile has been provided.

First, you need to change the content of the run.sh file in order to fill your requirements, such as the files' locations and which operation should be carried out by the script.

Additionally, make sure you have given the correct reference if you are willing to directly install the dump within the Docker image by using wikidump-download-tools.

Then, you can build the Docker image by typing:

docker build -t wikidump .

Lastly, run the docker image:

docker run wikidump

Authors

This library was created by Alessio Bogon and then expanded by Cristian Consonni.

The here presented project is implemented on the pre-existent structure.

About

Framework for the extraction of features from Wikipedia XML dumps.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 97.0%
  • Shell 1.9%
  • Other 1.1%