-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data set splitting - nested_training #529
Comments
Hey @carlpe, Thank you for using gandlf. I think we should improve the documentation to explain this better. Here the numbers you see are not the number of subjects to be used in the training process but rather the number of folds in which the data should be split and '-' indicates if this is to be run only once. So setting validation to -10 means that the 10th fold of the validation subset is to be used instead of 10 subjects, the use case you want to have. I would suggest setting
This would set the testing to 93/30 ~= 3 cases and validation to 93/9 ~=10 cases. Let us know if this worked for you! |
Thank you for clearing that up @Geeks-Sid ! Your suggestion made the splitting the way I wanted it. It was however hard to understand this out of the current documentation 🙂 |
I would suggest keeping this issue open and adding a PR to the documentation linking this. Thoughts? |
I wonder if it would be possible to rewrite the code in order for the user to choose how many he/she wants to use for validation and testing directly without having to calculate the number based on the number of folds? |
Stale issue message |
Hey @carlpe, this probably won't be a solution we would advocate for, since it would make it too easy for a user to specify a number that was too large/small to generate meaningful results. Using folds would be the more appropriate mechanism. |
Ok, thank you for elaborating on this |
When choosing data set splitting under "nested_training" in configuration, the splitting is not done correctly in my case.
Example; I have 93 subjects. I want to use 10 subjects for validation and 3 for testing. With the below settings, the split turns out as follows: training: 55 validation: 8 testing: 30
There is also no "data_testing.csv" saved. Only data_training.csv and data_validation.csv.
It works fine if validation is set to -10 and testing is set to 1.. Then it will be 83 subjects for training and 10 for validation, and none for testing.
Have I missed something here? Also the documentation on web and sample config is sparse. Suggestion to add more info for example what happens when we choose a positive or negative number.
GaNDLF Version
0.16
Desktop (please complete the following information):
OS: Linux Ubuntu 20.04
The text was updated successfully, but these errors were encountered: