Dataset

This folder contains raw datasets and processing scripts for each language/PII feature. Datasets are either:

scraped from an open web interface
acquired through an API request / database copy
sourced from various open source repositories with compatible licenses

The final datasets are processed and aggregated by generate_dataset.py. The result is stored in aggregate/{language-code}/ds_full.json and exported to the npm package language models:

Dutch aggregate and npm language data

Features

Features are implemented as Python modules, this allows features to depend on other features (e.g. when filtering out brand-names from medicine names), this can be useful when preprocessing of one or more dataset(s) is necessary, but only storage of the original dataset is desirable. Each module has a NAME variable which will be used as the feature name, as well as a get_wordlists() function, which returns the word lists for the feature as a dictionary of dictionaries. The main word list contains text, which if matched, could indicate the presence of the feature. Other word lists can be added to provide more certainty or contextual hints which might indicate presence or absence of the feature.

Structure

Folder

generate_dataset.py - aggregates all features / datasets and exports the files to the correct locations
{language-code} - sub folder for a language
- feature_sets - folder containing all "feature modules"
  - __init__.py - topmost module, contains all "feature modules" in a list
  - feature_name/ - folder containing feature module and raw datasets
    - __init__.py - "feature module", containing feature dataset loading and preprocessing
    - raw/ - raw datasets (such as a database / scripts for aggregation or hand-made lists)
- feature_templates - contains benchmark and mapping data

`aggregate/{language-code}/ds_full.json format`

{
    "name": "pii_dataset_nl",
    "version": 0, 
    "wordlists": {
        "medicine_names": {
            "main": [
               "aafact",
               "abacavir",
               "abacavir accord",
               "abacavir hexal",
                ...
            ],
            ...
        },
        ...
    }
}

Included datasets, repositories, and their Licenses:

The names, URLs, and licenses of the various open source repositories contained in this folder are listed below. The selected contents of these repositories have been included as a copy, instead of as submodules. This removes the direct dependency on the remote, which might change or be removed. Some datasets contained in these repositories have been mined from open databases or by scraping a web interface. If you are an owner of one of the datasets which is listed below and have an objection to it's use in this software, feel free to open an issue.

Lists of names are generally not copyrightable, however, since work might have gone in to either compiling or scraping this information, the licenses for these repositories are also stated and linked for further reference.

Repository / Dataset	License
NL-dictionary-file	MIT License
voornamen	MIT License
DutchFirstNames	MIT License
family-names-in-the-netherlands	MIT License
DutchNameGenerator	MIT License
name-dataset	Apache License 2.0
drugstandards	MIT License
RXNORM	-
GMIB	-
GIPdatabank	-

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset

Features

Structure

Folder

`aggregate/{language-code}/ds_full.json format`

Included datasets, repositories, and their Licenses:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset

Features

Structure

Folder

aggregate/{language-code}/ds_full.json format

Included datasets, repositories, and their Licenses:

`aggregate/{language-code}/ds_full.json format`