GitHub - janmotl/cv: Stratified cross-validation for multi-label classification

Stratified cross-validation for multi-label classification

One way how to evaluate the accuracy of machine learning models is via cross-validation. When we are dealing with classification, we may want to use stratified cross-validation, which preserves the distribution of the classes in the whole data set in the individual folds. However, common implementations of stratified cross-validation work only with a single label. This code performs stratified assignment of multi-label samples into folds, where the labels are all nominal.

Assignment objectives

Preserve the distribution of individual class-values across folds (1-way interaction)
Preserve the distribution of 2-way interactions between individual class-values across folds
Preserve the distribution of n-way interactions between individual class-values across folds, where n is the count of labels

Literature review

One way how to quickly extend stratified cross-validation into multi-label stratified cross-validation is by concatenating the class labels into a single label. And run the standard stratified cross-validation. This approach takes care of preserving the n-way interactions listed above, but of nothing else.

Another approach is to maintain 1-way interactions. This was done by (Sechidis, 2011). And later on extended by (Szymański, 2017) to optimize both, 1-way and 2-way interactions. We optimize all these three criteria at once.

Why bother?

Stratified cross-validation generally improves (plain) cross-validation in the following aspects:

It makes sure that each class-value is present in the testing set. This is important for the evaluation of many performance measures.
It maintains the same class prior distribution across all the folds. This increases the measured testing accuracy and minimizes the variance of the testing accuracy.

Solution

We use Integer Linear Programming (ILP) to reach the optimal solution. Hence, the solution is not an approximation but is exact. The disadvantage, in comparison to greedy solutions from (Sechidis, 2011) and (Szymański, 2017) is that the calculation is slow. Hence, we provide pre-calculated assignments for 10-fold cross-validation for some common multi-label classification data sets at Multi-Label Classification Dataset Repository.

Acknowledgements

The data are from Multi-Label Classification Dataset Repository by Mojano et al.
The first published article about stratified multi-label cross-validation is (Sechidis, 2011)
The evaluation metrics were implemented in scikit-multilearn by Szymański et al.

Without their work, this page would not exist.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
.gitignore		.gitignore
README.md		README.md
readme_developer.txt		readme_developer.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

janmotl/cv

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages