- Place the input
.ttl
files in some directoryyour_data_dir
. - Start the required services:
docker-compose up -d db rabbitmq
. - Add the input files to the processing queue for the label extraction:
DATA_DIR=your_data_dir make queue-all-files
. - Start the label extraction process:
make extract
(can run in parallel with the previous step). All extracted labels are added to the database. - Generate the corpus file:
./lodcat-generate-corpus your_data_dir corpus/corpus.xml
. Filecorpus/corpus.xml
will be generated. - Generate the object file:
./lodcat-generate-object corpus/corpus.xml object/object.gz
. Fileobject/object.gz
will be generated. - Generate the model file:
make generate-model
. Filemodel/model.gz
will be generated.
./list-wikidump-redirects <(bzcat *-pages-articles-multistream.xml.bz2) >redirects.txt
./lodcat-part-corpus <input directory> <output directory> <documents per file> <number of parallel jobs>
./lodcat-preproc-wiki-parts <input directory> <output directory> <number of parallel jobs>
./lodcat-generate-wiki-object --input <input directory> --output <output file> [--output-type {object.gz|xml}] [--include-names <file>] [--exclude-names <file>]
Topic quality is measured with Palmetto.
./lodcat-measure-quality C_P <model directory>
Output: quality.csv
with quality value for each topic.
Labels are generated with NETL.
./lodcat-generate-labels <model directory>
Output: labels-supervised.csv
, labels-unsupervised.csv
with label candidates for each topic.
./lodcat-quality-number <object file> <output directory> <number of repeats for the same parameters> <number of jobs to run in parallel>
./lodcat-quality-number-report <output directory>
Output: micro-quality.csv
, micro-quality.png
.
CSV and a corresponding plot will be generated in the specified output directory.
First, build word count data from corpus files:
./lodcat-count-words <model directory> <directory with XML corpus files> <output directory> <number of jobs to run in parallel>
Then, run the classifier:
./lodcat-classify <model directory> <directory with word count files from the previous step> <output directory> <number of jobs to run in parallel>
./lodcat-model-report <model directory>
./lodcat-classification-report <model directory> <classification file> <output directory>
./lodcat-uri-counts corpus/corpus.xml occurrence_dir
./lodcat-uri-counts-report occurrence_dir/documents-per-namespace
CSVs and plots will be generated in the specified occurrence_dir
.
./lodcat-document-rdf-size <directory with HDT files> <output directory> <number of jobs>