Skip to content

Detailed Guide

Vered Shwartz edited this page Apr 2, 2019 · 6 revisions

How to use LexNET?

Creating the Corpus

The LexNET corpus is used to extract connecting dependency paths between target words. You can get it in one of the following ways:

  • Download the corpus used in the paper: we used the English Wikipedia dump from May 2015 (available here). You will need to convert the XML to text, which you can do with WikiExtractor:
git clone https://github.com/attardi/wikiextractor.git;
wget https://dumps.wikimedia.org/enwiki/[some_date]/enwiki-[some_date]-pages-articles-multistream.xml.bz2;
mkdir text;
python wikiextractor/WikiExtractor.py --processes 20 -o text/ wiki_dump.bz2;
cat text/*/* > wiki_text;
  • We computed the paths between the most frequent unigrams, bigrams and trigrams in Wikipedia (based on GloVe vocabulary and the most frequent 100k bigrams and trigrams). The files for the Wiki corpus are available here.

  • Creating a custom parsed corpus:

    • Run the script parse_wikipedia: parse_wikipedia.py <wiki_file> <vocabulary_file> <out_file>, where:

      • wiki_file is the Wikipedia dump file obtained from here; for better runtime, split the dump to as many chunks as possible and run in parallel.
      • vocabulary_file is a file that contains a list of words that should be included in the resource (as target words), each word in a single row. In the paper we used the most common 400k words in Wikipedia + the most common 100k bigrams and trigrams in Wikipedia. The file is available here.
      • out_file is where the parsed corpus should be saved. This script creates a triplet file of the paths, formatted as: x\ty\tpath.
    • Run the script create_resource_from_corpus: create_resource_from_corpus.py <triplets_file> <resource_prefix>, where:

      • triplets_file is the output of the previous script.
      • resource_prefix is the file names' prefix for the resource files (e.g. /home/vered/lexnet/corpus/wiki, where wiki is not a directory). This directory should contain the entities file (resource_prefix + 'Entities.txt', same as vocabulary_file from the previous script), and the path file (resource_prefix + 'Paths.txt'). The path file tells the model which paths to consider. In the paper, we only considered frequent paths (e.g. paths that occurred at least 5 times), and the file is available here. In both files, each entity/path is in a separate line.

    This script creates the .db files under the resource_prefix directory + prefix.

Getting a Dataset

Any dataset of semantic relations between words can be used with LexNET. The folllowing datasets, used in the paper, are available in the datasets directory:

Each dataset is split to train, test and validation sets. Alternatively, you can provide your own dataset. The directory needs to contain 3 files, whose names end with '_train.tsv', '_val.tsv', and '_test.tsv' for the train, validation, and test sets respectively. Each line is a separate entry, formatted as x\ty\trelation.

In addition, it should also contain the relations.txt file, a file containing the relations in the dataset, each in a separate line. It can be created with the following command: cut -f3 -d$'\t' train.tsv | sort -u > relations.txt.

Training and Evaluating the Model

To train a LexNET model (the integrated model), run the folllowing command:

train_integrated.py [corpus_prefix] [dataset_prefix] [model_prefix_file] [embeddings_file] [num_hidden_layers]

Where:

  • corpus_prefix is the file path and prefix of the corpus files, e.g. corpus/wiki, such that the directory corpus contains the wiki_*.db files created by create_resource_from_corpus.py.
  • dataset_prefix is the file path of the dataset files, such that this directory contains 4 files: train.tsv, test.tsv and val.tsv for the train, test and validation sets, respectively, and relations.txt, a file containing the relations in the dataset.
  • model_prefix_file is the output directory and prefix for the model files. The model is saved in 3 files: .model, .params and .dict. In addition, the test set predictions are saved in .predictions, and the prominent paths are saved to .paths.
  • embeddings_file is the pre-trained word embeddings file, in txt format (i.e., every line consists of the word, followed by a space, and its vector. See GloVe for an example.)
  • num_hidden_layers is the number of network hidden layers (0 and 1 are supported).

The script trains several models, tuning the word dropout rate and the learning rate using the validation set. The best performing model on the validation set is saved and evaluated on the test set.

Alternatively, you can train a model that uses only path-based information without the distribtional information, by running the script train_path_based.py.

Clone this wiki locally