The Gazetteer Creator is a tool designed to facilitate the creation of gazetteers for Named Entity Recognition (NER) tasks. It leverages the Wikidata search tool to compile comprehensive lists of named entities relevant to your NER projects.
To begin using the Gazetteer Creator, follow these steps:
Run: bash initialize_env.sh
then open new terminal
python datasets/process_multiconer.py --file <path to multiconer training conll file>
python datasets/process_vimq.py --file <path to vimq training json file>
python datasets/process_rdrs.py --dir <path to dir containing 5 folds of RDRS (in form "~RDRS-main/data/interim")>
bash make_gazetteer.sh <threshold> <limit> <lang> <dataset>
Where:
threshold
is threshold for similarity check between wikidata topic and synonyms of labels
limit
is number of pages returned by Wikidata search tool
lang
is languange of pages return by Wikidata search tool
dataset
can be either vimq
/ multiconer
/ rdrs