refactor len(datasets) call. #953

williamFalcon · 2020-02-26T12:21:05Z

🚀 Feature

Let's minimize len(dataset) calls and do it as late in the training as we can (ie: ideally right before any training loop). This way, we can open up the path to support iterable datasets more cleanly.

Motivation

Getting the length prematurely calls datasets at the wrong time often causing double loads.

This is a blocker to #948

ethanwharris · 2020-02-26T12:24:19Z

@williamFalcon I'm happy to take a look at this if needed, just let me know :)

williamFalcon · 2020-02-26T12:30:30Z

Perfect!

versatran01 · 2020-02-26T16:21:38Z

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L149

In this function, auto_add_sampler() is always called.

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L92

And inside, even though the comment says

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L93

what it does is create a new pytorch DataLoader. I think this logic is flawed.

the code doesn't agree with the comment, which is confusing.
the data loader should be a very abstract thing that just returns the next batch. It might also know the size of the dataset. The current implementation makes an assumption on what a data loader is, which i think is unnecessary. For example, any call to loader.batch_size or loader.dataset should be avoided in the default setting, when all we need is to keep iterating the dataloader. Although I agree in more advanced settings maybe these are necessary.

What I suggest is that in the default setting, only call len(loader) to maybe determine the size.

ethanwharris · 2020-02-26T18:00:19Z

Ok, have a look at #955 - should fix a few things and make it easy to add support for iterable datasets everywhere

williamFalcon added feature Is an improvement or enhancement help wanted Open to be worked on labels Feb 26, 2020

williamFalcon modified the milestones: 0.6.2, 0.6.1 Feb 26, 2020

williamFalcon mentioned this issue Feb 26, 2020

Support IterableDatasets for validation and test, not just train set [blocked by #953] #948

Closed

williamFalcon assigned ethanwharris Feb 26, 2020

ethanwharris mentioned this issue Feb 26, 2020

Clean up dataloader logic #926

Merged

ethanwharris mentioned this issue Feb 26, 2020

Refactor dataloading #955

Merged

5 tasks

williamFalcon closed this as completed in #955 Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor len(datasets) call. #953

refactor len(datasets) call. #953

williamFalcon commented Feb 26, 2020 •

edited

Loading

ethanwharris commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

versatran01 commented Feb 26, 2020 •

edited

Loading

ethanwharris commented Feb 26, 2020

refactor len(datasets) call. #953

refactor len(datasets) call. #953

Comments

williamFalcon commented Feb 26, 2020 • edited Loading

🚀 Feature

Motivation

ethanwharris commented Feb 26, 2020

williamFalcon commented Feb 26, 2020

versatran01 commented Feb 26, 2020 • edited Loading

ethanwharris commented Feb 26, 2020

williamFalcon commented Feb 26, 2020 •

edited

Loading

versatran01 commented Feb 26, 2020 •

edited

Loading