Preventing data leakage when splitting data in setup method #16868

bencwallace · 2023-02-24T21:18:37Z

bencwallace
Feb 24, 2023

The example code in What is a DataModule? randomly splits the dataset into training and validation sets as part of the setup method. When executing on multiple devices, this will generally lead to having different splits on different devices. Since (I assume) gradients are synchronized across devices, this would mean there's overlap in the data being used for training and for validation, i.e. data leakage.

Am I missing something here? What is the recommended way to avoid this issue? My guess would have been to perform the split as part of __init__ or prepare_data, but apparently that doesn't work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preventing data leakage when splitting data in setup method #16868

{{title}}

Replies: 0 comments

Select a reply

Preventing data leakage when splitting data in setup method #16868

bencwallace Feb 24, 2023

Replies: 0 comments

bencwallace
Feb 24, 2023