Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data set splitting - nested_training #529

Closed
carlpe opened this issue Nov 29, 2022 · 7 comments · Fixed by #549
Closed

Data set splitting - nested_training #529

carlpe opened this issue Nov 29, 2022 · 7 comments · Fixed by #549

Comments

@carlpe
Copy link

carlpe commented Nov 29, 2022

When choosing data set splitting under "nested_training" in configuration, the splitting is not done correctly in my case.

Example; I have 93 subjects. I want to use 10 subjects for validation and 3 for testing. With the below settings, the split turns out as follows: training: 55 validation: 8 testing: 30

nested_training:
  {
    testing: -3, # this controls the testing data splits for final model evaluation; use '1' if this is to be disabled
    validation: -10 # this controls the validation data splits for model training
  }

There is also no "data_testing.csv" saved. Only data_training.csv and data_validation.csv.

It works fine if validation is set to -10 and testing is set to 1.. Then it will be 83 subjects for training and 10 for validation, and none for testing.

Have I missed something here? Also the documentation on web and sample config is sparse. Suggestion to add more info for example what happens when we choose a positive or negative number.

GaNDLF Version
0.16

Desktop (please complete the following information):
OS: Linux Ubuntu 20.04

@Geeks-Sid
Copy link
Collaborator

Hey @carlpe, Thank you for using gandlf.

I think we should improve the documentation to explain this better.

Here the numbers you see are not the number of subjects to be used in the training process but rather the number of folds in which the data should be split and '-' indicates if this is to be run only once.

So setting validation to -10 means that the 10th fold of the validation subset is to be used instead of 10 subjects, the use case you want to have.

I would suggest setting

nested_training:
  {
    testing: -30, # this controls the testing data split for final model evaluation; use '1' if this is to be disabled
    validation: -9 # this controls the validation data split for model training
  }

This would set the testing to 93/30 ~= 3 cases and validation to 93/9 ~=10 cases.

Let us know if this worked for you!

@carlpe
Copy link
Author

carlpe commented Nov 30, 2022

Thank you for clearing that up @Geeks-Sid !

Your suggestion made the splitting the way I wanted it. It was however hard to understand this out of the current documentation 🙂

@carlpe carlpe closed this as completed Nov 30, 2022
@sarthakpati
Copy link
Collaborator

I would suggest keeping this issue open and adding a PR to the documentation linking this. Thoughts?

@carlpe carlpe reopened this Dec 1, 2022
@carlpe carlpe changed the title Bug in data set splitting - nested_training Data set splitting - nested_training Dec 1, 2022
@carlpe
Copy link
Author

carlpe commented Dec 1, 2022

I wonder if it would be possible to rewrite the code in order for the user to choose how many he/she wants to use for validation and testing directly without having to calculate the number based on the number of folds?

@github-actions
Copy link
Contributor

Stale issue message

@sarthakpati
Copy link
Collaborator

I wonder if it would be possible to rewrite the code in order for the user to choose how many he/she wants to use for validation and testing directly without having to calculate the number based on the number of folds?

Hey @carlpe, this probably won't be a solution we would advocate for, since it would make it too easy for a user to specify a number that was too large/small to generate meaningful results. Using folds would be the more appropriate mechanism.

sarthakpati added a commit that referenced this issue Feb 2, 2023
@carlpe
Copy link
Author

carlpe commented Feb 3, 2023

Ok, thank you for elaborating on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants