Skip to content
/ cv Public

Stratified cross-validation for multi-label classification

Notifications You must be signed in to change notification settings

janmotl/cv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Stratified cross-validation for multi-label classification

One way how to evaluate the accuracy of machine learning models is via cross-validation. When we are dealing with classification, we may want to use stratified cross-validation, which preserves the distribution of the classes in the whole data set in the individual folds. However, common implementations of stratified cross-validation work only with a single label. This code performs stratified assignment of multi-label samples into folds, where the labels are all nominal.

Assignment objectives

  1. Preserve the distribution of individual class-values across folds (1-way interaction)
  2. Preserve the distribution of 2-way interactions between individual class-values across folds
  3. Preserve the distribution of n-way interactions between individual class-values across folds, where n is the count of labels

Literature review

One way how to quickly extend stratified cross-validation into multi-label stratified cross-validation is by concatenating the class labels into a single label. And run the standard stratified cross-validation. This approach takes care of preserving the n-way interactions listed above, but of nothing else.

Another approach is to maintain 1-way interactions. This was done by (Sechidis, 2011). And later on extended by (Szymański, 2017) to optimize both, 1-way and 2-way interactions. We optimize all these three criteria at once.

Why bother?

Stratified cross-validation generally improves (plain) cross-validation in the following aspects:

  1. It makes sure that each class-value is present in the testing set. This is important for the evaluation of many performance measures.
  2. It maintains the same class prior distribution across all the folds. This increases the measured testing accuracy and minimizes the variance of the testing accuracy.

Solution

We use Integer Linear Programming (ILP) to reach the optimal solution. Hence, the solution is not an approximation but is exact. The disadvantage, in comparison to greedy solutions from (Sechidis, 2011) and (Szymański, 2017) is that the calculation is slow. Hence, we provide pre-calculated assignments for 10-fold cross-validation for some common multi-label classification data sets at Multi-Label Classification Dataset Repository.

Acknowledgements

  1. The data are from Multi-Label Classification Dataset Repository by Mojano et al.
  2. The first published article about stratified multi-label cross-validation is (Sechidis, 2011)
  3. The evaluation metrics were implemented in scikit-multilearn by Szymański et al.

Without their work, this page would not exist.

About

Stratified cross-validation for multi-label classification

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published