Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor len(datasets) call. #953

Closed
williamFalcon opened this issue Feb 26, 2020 · 4 comments · Fixed by #955
Closed

refactor len(datasets) call. #953

williamFalcon opened this issue Feb 26, 2020 · 4 comments · Fixed by #955
Assignees
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 26, 2020

🚀 Feature

Let's minimize len(dataset) calls and do it as late in the training as we can (ie: ideally right before any training loop). This way, we can open up the path to support iterable datasets more cleanly.

Motivation

Getting the length prematurely calls datasets at the wrong time often causing double loads.

This is a blocker to #948

@ethanwharris
Copy link
Member

@williamFalcon I'm happy to take a look at this if needed, just let me know :)

@williamFalcon
Copy link
Contributor Author

Perfect!

@versatran01
Copy link

versatran01 commented Feb 26, 2020

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L149

In this function, auto_add_sampler() is always called.

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L92

And inside, even though the comment says

https://github.com/PyTorchLightning/pytorch-lightning/blob/be244560b24b68b0236a4694707fb9bb63c2e6d0/pytorch_lightning/trainer/data_loading.py#L93

what it does is create a new pytorch DataLoader. I think this logic is flawed.

  1. the code doesn't agree with the comment, which is confusing.
  2. the data loader should be a very abstract thing that just returns the next batch. It might also know the size of the dataset. The current implementation makes an assumption on what a data loader is, which i think is unnecessary. For example, any call to loader.batch_size or loader.dataset should be avoided in the default setting, when all we need is to keep iterating the dataloader. Although I agree in more advanced settings maybe these are necessary.

What I suggest is that in the default setting, only call len(loader) to maybe determine the size.

@ethanwharris ethanwharris mentioned this issue Feb 26, 2020
5 tasks
@ethanwharris
Copy link
Member

Ok, have a look at #955 - should fix a few things and make it easy to add support for iterable datasets everywhere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants