The BigData ML algorithms used:
- Classification
- Logistic Regression
- Linear SVM
- Random Forest
.
├── data
├── images
│ └── logos
├── logs
├── outputs
├── reports
└── utils
7 directories
The data was obtained from here and was first used and introduced by Schmidtmann et al. [1].
Dataset info:
- features
- id_1: internal identifier of first record.
- id_2: internal identifier of second record.
- cmp_fname_c1: agreement of first name, first component
- cmp_fname_c2: agreement of first name, second component
- cmp_lname_c1: agreement of family name, first component
- cmp_lname_c2: agreement of family name, second component
- cmp_sex: agreement sex
- cmp_bd: agreement of date of birth, day component
- cmp_bm: agreement of date of birth, month component
- cmp_by: agreement of date of birth, year component
- cmp_plz: agreement of postal code
- is_match: matching status (TRUE for matches, FALSE for non-matches)
Should you have any questions, feel free to contact TekBoArt @tekboart.
[1] Irene Schmidtmann, Gael Hammer, Murat Sariyar, Aslihan Gerhold-Ay: Evaluation des Krebsregisters NRW Schwerpunkt Record Linkage. Technical Report, IMBEI 2009.
- Refer to the file
LICENSE
for more information regarding the license of this repository.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.