Data set splitting - nested_training #529

carlpe · 2022-11-29T18:10:23Z

When choosing data set splitting under "nested_training" in configuration, the splitting is not done correctly in my case.

Example; I have 93 subjects. I want to use 10 subjects for validation and 3 for testing. With the below settings, the split turns out as follows: training: 55 validation: 8 testing: 30

nested_training:
  {
    testing: -3, # this controls the testing data splits for final model evaluation; use '1' if this is to be disabled
    validation: -10 # this controls the validation data splits for model training
  }

There is also no "data_testing.csv" saved. Only data_training.csv and data_validation.csv.

It works fine if validation is set to -10 and testing is set to 1.. Then it will be 83 subjects for training and 10 for validation, and none for testing.

Have I missed something here? Also the documentation on web and sample config is sparse. Suggestion to add more info for example what happens when we choose a positive or negative number.

GaNDLF Version
0.16

Desktop (please complete the following information):
OS: Linux Ubuntu 20.04

The text was updated successfully, but these errors were encountered:

Geeks-Sid · 2022-11-30T20:09:35Z

Hey @carlpe, Thank you for using gandlf.

I think we should improve the documentation to explain this better.

Here the numbers you see are not the number of subjects to be used in the training process but rather the number of folds in which the data should be split and '-' indicates if this is to be run only once.

So setting validation to -10 means that the 10th fold of the validation subset is to be used instead of 10 subjects, the use case you want to have.

I would suggest setting

nested_training:
  {
    testing: -30, # this controls the testing data split for final model evaluation; use '1' if this is to be disabled
    validation: -9 # this controls the validation data split for model training
  }

This would set the testing to 93/30 ~= 3 cases and validation to 93/9 ~=10 cases.

Let us know if this worked for you!

carlpe · 2022-11-30T23:42:16Z

Thank you for clearing that up @Geeks-Sid !

Your suggestion made the splitting the way I wanted it. It was however hard to understand this out of the current documentation 🙂

sarthakpati · 2022-12-01T00:50:27Z

I would suggest keeping this issue open and adding a PR to the documentation linking this. Thoughts?

carlpe · 2022-12-01T00:56:10Z

I wonder if it would be possible to rewrite the code in order for the user to choose how many he/she wants to use for validation and testing directly without having to calculate the number based on the number of folds?

github-actions · 2023-01-30T19:47:50Z

Stale issue message

sarthakpati · 2023-02-02T19:34:41Z

I wonder if it would be possible to rewrite the code in order for the user to choose how many he/she wants to use for validation and testing directly without having to calculate the number based on the number of folds?

Hey @carlpe, this probably won't be a solution we would advocate for, since it would make it too easy for a user to specify a number that was too large/small to generate meaningful results. Using folds would be the more appropriate mechanism.

added more information for #529

carlpe · 2023-02-03T15:21:09Z

Ok, thank you for elaborating on this

carlpe closed this as completed Nov 30, 2022

carlpe reopened this Dec 1, 2022

carlpe changed the title ~~Bug in data set splitting - nested_training~~ Data set splitting - nested_training Dec 1, 2022

github-actions bot added the no-issue-activity label Jan 30, 2023

sarthakpati removed the no-issue-activity label Jan 30, 2023

sarthakpati added a commit that referenced this issue Feb 2, 2023

added more information for #529

582b7a1

sarthakpati mentioned this issue Feb 2, 2023

added more information for #529 #549

Merged

11 tasks

sarthakpati closed this as completed in #549 Feb 2, 2023

sarthakpati added a commit that referenced this issue Feb 2, 2023

Merge pull request #549 from mlcommons/sarthakpati-patch-1

cdb528b

added more information for #529

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data set splitting - nested_training #529

Data set splitting - nested_training #529

carlpe commented Nov 29, 2022

Geeks-Sid commented Nov 30, 2022

carlpe commented Nov 30, 2022

sarthakpati commented Dec 1, 2022

carlpe commented Dec 1, 2022 •

edited

Loading

github-actions bot commented Jan 30, 2023

sarthakpati commented Feb 2, 2023

carlpe commented Feb 3, 2023

Data set splitting - nested_training #529

Data set splitting - nested_training #529

Comments

carlpe commented Nov 29, 2022

Geeks-Sid commented Nov 30, 2022

carlpe commented Nov 30, 2022

sarthakpati commented Dec 1, 2022

carlpe commented Dec 1, 2022 • edited Loading

github-actions bot commented Jan 30, 2023

sarthakpati commented Feb 2, 2023

carlpe commented Feb 3, 2023

carlpe commented Dec 1, 2022 •

edited

Loading