Preventing data leakage when splitting data in setup method #16868
Unanswered
bencwallace
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The example code in What is a DataModule? randomly splits the dataset into training and validation sets as part of the
setup
method. When executing on multiple devices, this will generally lead to having different splits on different devices. Since (I assume) gradients are synchronized across devices, this would mean there's overlap in the data being used for training and for validation, i.e. data leakage.Am I missing something here? What is the recommended way to avoid this issue? My guess would have been to perform the split as part of
__init__
orprepare_data
, but apparently that doesn't work.Beta Was this translation helpful? Give feedback.
All reactions