[Grid] You must call wandb.init() before wandb.log() #7028

turian · 2021-04-15T02:57:14Z

🐛 Bug

I'm reopening #1356 because I'm getting this error running my code on grid.ai.

I am getting error:

wandb.errors.error.Error: You must call wandb.init() before wandb.log()

Please reproduce using the BoringModel

Not possible since colab has only one GPU, unlike grid.ai

To Reproduce

On grid.ai or multiple GPU machine, create a trainer with WandbLogger and do not specify an accelerator. Run with gpus=-1 and hit this error.

Despite #2029 the default is ddp_spawn, which triggers this error on grid.ai:

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.

Workaround:

In main, run

import wandb
wandb.init(project...)

(seems redudant and potentially dangerous/foot-gunny since you are already passing a WandbLogger to the trainer.

Make sure trainer has accelerator=ddp defined.

Expected behavior

wandb logger works when trainer is given WandbLogger, gpu=-1, and no accelerator is defined, nor is a duplicate wandb init needed to be called.

Environment

grid.ai

CUDA:
- GPU:
- Tesla M60
- Tesla M60
- available: True
- version: 10.2
Packages:
- numpy: 1.20.2
- pyTorch_debug: False
- pyTorch_version: 1.8.1+cu102
- pytorch-lightning: 1.2.7
- tqdm: 4.60.0
System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.10
- version: Proposal for help #1 SMP Tue Mar 16 04:56:19 UTC 2021

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-04-17T13:43:35Z

I ran this in an interactive session on grid with lightning 1.2.7

import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self, my_param: int = 2):
        super().__init__()
        self.save_hyperparameters()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)
        return {"x": loss}

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)
        return {"y": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    test_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)

    logger = WandbLogger(project="myproject")

    model = BoringModel()
    trainer = Trainer(
        gpus=-1,
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
        logger=logger,
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    trainer.test(model, test_dataloaders=test_data)


if __name__ == '__main__':
    run()

gridai@ixsession → python repro.py 
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
wandb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.26
wandb: Syncing run driven-darkness-4
wandb: ⭐️ View project at https://wandb.ai/awaelchli/myproject
wandb: 🚀 View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9
wandb: Run `wandb offline` to turn off syncing.

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: MASTER_ADDR environment variable is not defined. Set as localhost
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You are using `accelerator=ddp_spawn` with num_workers=0. For much faster performance, switch to `accelerator=ddp` and set `num_workers>0`
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                                                                                                                                                                                                                                                         | 0/2 [00:00<?, ?it/s][W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 131.16it/s, loss=-0.0434, v_num=ayd9w
andb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)                                                                                                                                                                                                                       
wandb: Tracking run with wandb version 0.10.26
wandb: Resuming run driven-darkness-4
wandb: ⭐️ View project at https://wandb.ai/awaelchli/myproject
wandb: 🚀 View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9/files/wandb/run-20210417_133848-9tigayd9
wandb: Run `wandb offline` to turn off syncing.

Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.47it/s, loss=-0.0434, v_num=ayd9]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: cleaning up ddp environment...
  warnings.warn(*args, **kwargs)

I don't get the error message you are mentioning. Any hints as to what I need to modify?

turian · 2021-04-17T19:57:02Z

Here is an example trying to log images or audio to wandb that breaks.

The following works (one GPU). Make sure to pip3 install soundfile first:

import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self, my_param: int = 2):
        super().__init__()
        self.save_hyperparameters()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        wandb.log({"examples": [wandb.Audio(torch.rand(32).cpu().numpy(), caption="Nice", sample_rate=32)]})
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)
        return {"x": loss}

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)
        return {"y": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    test_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)

    logger = WandbLogger(project="myproject")

    model = BoringModel()
    trainer = Trainer(
#        gpus=-1,
        gpus=1,
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
        logger=logger,
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    trainer.test(model, test_dataloaders=test_data)


if __name__ == '__main__':
    wandb.init()
    run()

If you switch to multiple GPUs, it breaks with:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-4-b4487cb8ccc5> in <module>
     73 if __name__ == '__main__':
     74     wandb.init()
---> 75     run()

<ipython-input-4-b4487cb8ccc5> in run()
     67         logger=logger,
     68     )
---> 69     trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
     70     trainer.test(model, test_dataloaders=test_data)
     71 

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    497 
    498         # dispath `start_training` or `start_testing` or `start_predicting`
--> 499         self.dispatch()
    500 
    501         # plugin will finalized fitting (e.g. ddp_spawn will load trained model)

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in dispatch(self)
    544 
    545         else:
--> 546             self.accelerator.start_training(self)
    547 
    548     def train_or_test_or_predict(self):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py in start_training(self, trainer)
     71 
     72     def start_training(self, trainer):
---> 73         self.training_type_plugin.start_training(trainer)
     74 
     75     def start_testing(self, trainer):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py in start_training(self, trainer)
    106 
    107     def start_training(self, trainer):
--> 108         mp.spawn(self.new_process, **self.mp_spawn_kwargs)
    109         # reset optimizers, since main process is never used for training and thus does not have a valid optim state
    110         trainer.optimizers = []

/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
    228                ' torch.multiprocessing.start_process(...)' % start_method)
    229         warnings.warn(msg)
--> 230     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    186 
    187     # Loop on join until it returns True or raises an exception.
--> 188     while not context.join():
    189         pass
    190 

/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    142                     error_index=error_index,
    143                     error_pid=failed_process.pid,
--> 144                     exit_code=exitcode
    145                 )
    146 

ProcessExitedException: process 0 terminated with exit code 1

If you switch to self.log, you get:

TypeError: log() missing 1 required positional argument: 'value'

Basically I want to wandb log images + audio + matplotlibs from within DDP

awaelchli · 2021-04-17T20:36:57Z

Thanks. I tried this and can see where the problem is.
Do the following:

Remove the manual wandb.init call at the bottom
Replace wandb.log({"examples": ... }) with self.logger.experiment.log(...)

This should work:) I can see the audio samples in the wandb run online. It doesn't play but I think that's because this dummy sample is too short.

Furthermore, we currently don't support images, audio etc. in self.log(), since the api depends on the specific logger. There are efforts to standardize this #6720
So for these custom objects, you have to call self.logger.experiment.log (which is basically the same as wandb.log)

EDIT: I tried your code with DDP as well. The fix above applies.

turian · 2021-04-17T23:28:49Z

@awaelchli thanks, I will try it. Is this documented somewhere?

awaelchli · 2021-04-17T23:42:39Z

We have a small section here
https://pytorch-lightning.readthedocs.io/en/latest/extensions/logging.html#manual-logging
Open for suggestions if needs improvements.

turian · 2021-04-18T00:46:21Z

I see. Thank.

I'm not exactly sure how to make it more clear, but the headline "Manual Logging" is maybe a bit off-base for me. "Manual Logging to a Supported or Custom Logger"?

yuelei0428 · 2023-10-17T14:23:27Z

I encountered the same issue and found that this can be simply fixed by moving wandb.init to the first line in your main function.

TaosLezz · 2024-07-13T02:09:55Z

You can try:
import wandb
wandb.init(mode='disabled')

turian added bug Something isn't working help wanted Open to be worked on labels Apr 15, 2021

SeanNaren changed the title ~~grid.ai Error: You must call wandb.init() before wandb.log()~~ [Grid] You must call wandb.init() before wandb.log() Apr 15, 2021

williamFalcon mentioned this issue Apr 17, 2021

more examples of multi-node, multi-gpu training gridai/gridai#21

Open

turian closed this as completed Apr 18, 2021

turian mentioned this issue Dec 4, 2022

WandBLogger and TensorboardLogger have different APIs for logging audio pytorch/ignite#2791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Grid] You must call wandb.init() before wandb.log() #7028

[Grid] You must call wandb.init() before wandb.log() #7028

turian commented Apr 15, 2021

awaelchli commented Apr 17, 2021

turian commented Apr 17, 2021

awaelchli commented Apr 17, 2021 •

edited

Loading

turian commented Apr 17, 2021

awaelchli commented Apr 17, 2021

turian commented Apr 18, 2021

yuelei0428 commented Oct 17, 2023

TaosLezz commented Jul 13, 2024

[Grid] You must call wandb.init() before wandb.log() #7028

[Grid] You must call wandb.init() before wandb.log() #7028

Comments

turian commented Apr 15, 2021

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

awaelchli commented Apr 17, 2021

turian commented Apr 17, 2021

awaelchli commented Apr 17, 2021 • edited Loading

turian commented Apr 17, 2021

awaelchli commented Apr 17, 2021

turian commented Apr 18, 2021

yuelei0428 commented Oct 17, 2023

TaosLezz commented Jul 13, 2024

awaelchli commented Apr 17, 2021 •

edited

Loading