Skip to content
This repository has been archived by the owner on Jun 2, 2023. It is now read-only.

Latest commit

 

History

History
79 lines (67 loc) · 4.72 KB

README.md

File metadata and controls

79 lines (67 loc) · 4.72 KB

Dataset

This folder contains raw datasets and processing scripts for each language/PII feature. Datasets are either:

  • scraped from an open web interface
  • acquired through an API request / database copy
  • sourced from various open source repositories with compatible licenses

The final datasets are processed and aggregated by generate_dataset.py. The result is stored in aggregate/{language-code}/ds_full.json and exported to the npm package language models:

Features

Features are implemented as Python modules, this allows features to depend on other features (e.g. when filtering out brand-names from medicine names), this can be useful when preprocessing of one or more dataset(s) is necessary, but only storage of the original dataset is desirable. Each module has a NAME variable which will be used as the feature name, as well as a get_wordlists() function, which returns the word lists for the feature as a dictionary of dictionaries. The main word list contains text, which if matched, could indicate the presence of the feature. Other word lists can be added to provide more certainty or contextual hints which might indicate presence or absence of the feature.

Structure

Folder

  • generate_dataset.py - aggregates all features / datasets and exports the files to the correct locations
  • {language-code} - sub folder for a language
    • feature_sets - folder containing all "feature modules"
      • __init__.py - topmost module, contains all "feature modules" in a list
      • feature_name/ - folder containing feature module and raw datasets
        • __init__.py - "feature module", containing feature dataset loading and preprocessing
        • raw/ - raw datasets (such as a database / scripts for aggregation or hand-made lists)
    • feature_templates - contains benchmark and mapping data

aggregate/{language-code}/ds_full.json format

{
    "name": "pii_dataset_nl",
    "version": 0, 
    "wordlists": {
        "medicine_names": {
            "main": [
               "aafact",
               "abacavir",
               "abacavir accord",
               "abacavir hexal",
                ...
            ],
            ...
        },
        ...
    }
}

Included datasets, repositories, and their Licenses:

The names, URLs, and licenses of the various open source repositories contained in this folder are listed below. The selected contents of these repositories have been included as a copy, instead of as submodules. This removes the direct dependency on the remote, which might change or be removed. Some datasets contained in these repositories have been mined from open databases or by scraping a web interface. If you are an owner of one of the datasets which is listed below and have an objection to it's use in this software, feel free to open an issue.

Lists of names are generally not copyrightable, however, since work might have gone in to either compiling or scraping this information, the licenses for these repositories are also stated and linked for further reference.

Repository / Dataset License
NL-dictionary-file MIT License
voornamen MIT License
DutchFirstNames MIT License
family-names-in-the-netherlands MIT License
DutchNameGenerator MIT License
name-dataset Apache License 2.0
drugstandards MIT License
RXNORM -
GMIB -
GIPdatabank -