[DataModule] `prepare_data()` and `setup()` not called #2742

remisphere · 2020-07-28T15:54:40Z

🐛 Bug

It seems that when using DataModule to separate training logic and data loading,
of the five methods that should be called that are
prepare_data(), setup(), train_dataloader(), val_dataloader() and test_dataloader(),
only the last three are actually used, witch is problematic since the datasets used by the data-loaders should be assigned in the setup().

To Reproduce

Steps to reproduce the behavior:
Run this:

Code sample

import torch
from pytorch_lightning import LightningDataModule
from pytorch_lightning.core.lightning import LightningModule
from pytorch_lightning.trainer import Trainer
from torch.nn import L1Loss, Linear
from torch.optim import SGD
from torch.utils.data import DataLoader


class MyDataModule(LightningDataModule):

    def __init__(self):
        super().__init__()

    def prepare_data(self):
        print('in prepare_data, '
              'this should be called before train_dataloader() but is not.')

    def setup(self, stage):
        print('in setup, '
              'this should be called before train_dataloader() but is not.')
        self.train_dataset = 'whatever'

    def train_dataloader(self):
        print('in train_dataloader')
        return DataLoader(self.train_dataset)


class MyLightningModule(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = Linear(1, 1)
        self.loss_function = L1Loss()

    def forward(self, x):
        return self.layer(x)

    def configure_optimizers(self):
        return SGD(self.parameters(), lr=0.01)

    def training_step(self, batch, batch_idx):
        print("you won't even get here")
        raise NotImplementedError


data_module = MyDataModule()
model = MyLightningModule()
trainer = Trainer(gpus=1)
trainer.fit(model, data_module)

this gives AttributeError: 'MyDataModule' object has no attribute 'train_dataset'.

Expected behavior

When entering train_dataloader(), prepare_data() and setup() should already have been executed, and thus the train_dataset attribute should exist.

Additional context

IMHO, it comes from here

Environment

CUDA:
- GPU:
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
- available: True
- version: 10.1
Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.5.1+cu101
- pytorch-lightning: 0.9.0rc2
- tensorboard: 2.3.0
- tqdm: 4.48.0
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.6
- version: Unable to import trainer #41-Ubuntu SMP Tue Dec 3 00:27:35 UTC 2019

The text was updated successfully, but these errors were encountered:

nateraw · 2020-07-28T16:18:17Z

you're not specifying the datamodule kwarg in trainer.fit() - your last line should look like this: trainer.fit(model, datamodule=data_module)
In this first iteration of LightningDataModule, you have to call setup and prepare_data manually for the datamodule instance. We have it set up this way so if you don't want to use Lightning, you can use your datamodule's loaders with pure Pytorch. I thought of having them called implicitly in the PR, but ended up landing on this for now. I'm not sure if users would always want these to run implicitly.

TL;DR: you can update your code to look like this:

# Init a datamodule
dm = MyDataModule()

# Manually call prepare_data and setup. You could put this at end of __init__ if you want
dm.prepare_data()
dm.setup()

model = MyLightningModule()
trainer = Trainer(gpus=1)
trainer.fit(model, datamodule=dm)

That being said, we're open to any ideas on making this more intuitive, so feel free to throw out some alternatives. 😄

remisphere · 2020-07-28T16:39:39Z

is not true in 0.9.0rc2: a data module as second positional argument is taken care of here.
I don't have a global enough view to know what other users might want, so if it is a feature i'm fine with it.
I just saw that the manual call was in the docs, my bad for not looking far enough.

Anyway thank you for the clear answer ^^

nateraw · 2020-07-29T19:54:45Z

@remisphere I totally didn't notice! You were completely right on the dm arg. things move fast haha.

Reopening actually, as I think your intended use is more user friendly.

remisphere added bug Something isn't working help wanted Open to be worked on labels Jul 28, 2020

remisphere closed this as completed Jul 28, 2020

ananyahjha93 mentioned this issue Jul 29, 2020

Use Lightning DataModules Lightning-Universe/lightning-bolts#130

Merged

4 tasks

nateraw reopened this Jul 29, 2020

nateraw mentioned this issue Jul 29, 2020

Call DataModule hooks implicitly in trainer #2755

Merged

7 tasks

nateraw linked a pull request Jul 29, 2020 that will close this issue

Call DataModule hooks implicitly in trainer #2755

Merged

7 tasks

edenlightning assigned edenlightning and nateraw and unassigned edenlightning Jul 29, 2020

Borda added the data handling Generic data-related topic label Jul 31, 2020

williamFalcon closed this as completed in #2755 Aug 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataModule] `prepare_data()` and `setup()` not called #2742

[DataModule] `prepare_data()` and `setup()` not called #2742

remisphere commented Jul 28, 2020

nateraw commented Jul 28, 2020 •

edited

Loading

remisphere commented Jul 28, 2020 •

edited

Loading

nateraw commented Jul 29, 2020

[DataModule] prepare_data() and setup() not called #2742

[DataModule] prepare_data() and setup() not called #2742

Comments

remisphere commented Jul 28, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Additional context

Environment

nateraw commented Jul 28, 2020 • edited Loading

remisphere commented Jul 28, 2020 • edited Loading

nateraw commented Jul 29, 2020

[DataModule] `prepare_data()` and `setup()` not called #2742

[DataModule] `prepare_data()` and `setup()` not called #2742

nateraw commented Jul 28, 2020 •

edited

Loading

remisphere commented Jul 28, 2020 •

edited

Loading