WandbLogger cannot be used with 'ddp' #981

rmrao · 2020-02-28T23:27:35Z

🐛 Bug

wandb modifies init such that a child process calling init returns None if the master process has called init. This seems to cause a bug with ddp, and results in rank zero having experiment = None, which crashes the program.

To Reproduce

Can be reproduced with the basic MNIST gpu template, simply add a WandbLogger and pass 'ddp' as the distributed backend.

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 331, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 757, in run_pretrain_routine
    self.logger.log_hyperparams(ref_model.hparams)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/logging/base.py", line 14, in wrapped_fn
    fn(self, *args, **kwargs)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/logging/wandb.py", line 79, in log_hyperparams
    self.experiment.config.update(params)
AttributeError: 'NoneType' object has no attribute 'config'

This occurs with the latest wandb version and with pytorch-lightning 0.6.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-02-28T23:28:14Z

Hi! thanks for your contribution!, great first issue!

rmrao · 2020-02-29T01:08:40Z

Some hacky solutions: calling reinit=True for wandb, adding or this terrible hack:

def init_ddp_connection(self, *args, **kwargs):
    super().init_ddp_connection(*args, **kwargs)

    if torch.distributed.get_rank() == 0:
        import wandb
        wandb.run = None

These both seem to only kind-of work and result in multiple independent calls to wandb.init. I think the ideal solution is that the experiment is only ever initialized on rank zero. However doing this means that wandb cannot be initialized in the master thread at all.

Better than this probably requires some changes to the wandb API.

rmrao · 2020-02-29T01:34:42Z

Following up slightly - my hacky solution doesn't really work. It's easy enough though to get the rank zero only solution to work. If this seems like a reasonable solution, let me know and I'll submit a PR.

Borda · 2020-02-29T09:01:53Z

well, have observed some issues with wandb earlier #906 could you check it?

rmrao · 2020-03-04T18:12:34Z

Hmm, I think this is a slightly different issue (I'm running on Ubuntu so I don't think that's the issue). Pickling also works correctly.

rmrao · 2020-03-04T18:38:52Z

This particular problem I think stems from this part of the wandb.init(...) code:

def init(...):
    ...
    # If a thread calls wandb.init() it will get the same Run object as
    # the parent. If a child process with distinct memory space calls
    # wandb.init(), it won't get an error, but it will get a result of
    # None.
    # This check ensures that a child process can safely call wandb.init()
    # after a parent has (only the parent will create the Run object).
    # This doesn't protect against the case where the parent doesn't call
    # wandb.init but two children do.
    if run or os.getenv(env.INITED):
        return run

Child processes end up getting None for the wandb run object, which causes logging to fail. There are probably two reasonable and complementary solutions:

The main thread should avoid creating a wandb experiment unless absolutely necessary.

Right now, this is the only part of the logging code that the parent thread calls (I assume it's called when pickling):

def __getstate__(self):
    state = self.__dict__.copy()
    # cannot be pickled
    state['_experiment'] = None
    # args needed to reload correct experiment
    state['_id'] = self.experiment.id
    return state

If this is changed to:

def __getstate__(self):
    state = self.__dict__.copy()
    # args needed to reload correct experiment
    if self._experiment is not None:
        state['_id'] = self._experiment.id
    else:
        state['_id'] = None

    # cannot be pickled
    state['_experiment'] = None
    return state

That will ensure that unless the user explicitly logs something or creates the wandb experiment first, then the main thread will not try to create an experiment. Since subsequent logging / saving code is wrapped by the @rank_zero_only decorator, this will generally solve the issue in the base case.

It's also possible that these properties are also called by master. Ideally they would be wrapped to not create the experiment unless it had been already created (i.e. experiment should only be created by a function that is wrapped with the @rank_zero_only decorator).

If the main thread has created an experiment, rank zero should be passed the re-init argument.

wandb does allow you to reinitialize the experiment. I tried to play around with this a little bit and got some errors, but in theory adding this:

wandb.init(..., reinit=dist.is_available() and dist.is_initialized() and dist.get_rank() == 0)

should force a re-initialization when wandb is already initialzed for rank zero.

vanpelt · 2020-03-14T23:22:11Z

@rmrao I just made a PR. It should be safe to always set reinit=True. I wasn't able to reproduce the specific environment you mentioned. Can you share some code or a Colab with me to be sure this works ok?

rmrao · 2020-03-14T23:40:36Z

I think I had tried setting reinit without fixing other calls to avoid creating the experiment unless necessary. I don’t think this will produce errors.

I haven’t experimented with reinit significantly and I’m not sure what it does in terms of creating a new run in wandb, etc. but I don’t think I have a setup that explicitly causes an error with reinit.

Borda · 2020-03-15T08:10:27Z

It would be good also add test for logger in ddp mode

VitorGuizilini · 2020-04-01T18:05:12Z

Has there been any progress on this? What is the fix to enable WandbLogger on 'ddp'?

julianmack · 2020-04-28T19:11:53Z

Is there any update to this? I'm also hoping to use wandb with ddp. I've upgraded to the most recent release 0.7.5 but the problem persists

williamFalcon · 2020-04-28T19:18:08Z

@vanpelt @shawnlewis

vanpelt · 2020-04-29T20:11:43Z

Can someone provide a small script to reproduce this? We can work on a fix and include it in the next release.

stas6626 · 2020-05-01T20:13:44Z

@julianmack try 0.7.3

parasj · 2020-05-07T23:26:06Z

I found a similar bug with the WandB logger where the checkpointing callback expects the logger.name property to not be None

https://github.com/PyTorchLightning/pytorch-lightning/blob/3a642601e84c3abf1f1b438f9acc932a1f150f7f/pytorch_lightning/trainer/callback_config.py#L54-L59

However, WandB will set the logger.name and logger.version properties to None if experiments are not being used.

https://github.com/PyTorchLightning/pytorch-lightning/blob/3a642601e84c3abf1f1b438f9acc932a1f150f7f/pytorch_lightning/loggers/wandb.py#L124-L133

parasj · 2020-05-07T23:40:54Z

I worked around this issue by adding the following line to the top of my training job

setattr(WandbLogger, 'name', property(lambda self: self._name))

sooheon · 2020-05-16T21:27:59Z

Just to clarify @parasj's solution, presumably you also have to pass in some name as a keyword arg while constructing the logger.

Unfortunately this means you're now responsible for generating unique names for subsequent runs.

kaoutar55 · 2021-02-26T03:56:15Z

I am getting a similar issue when using wandb in the DDP with hugging face implementation of Bert. I am running on one node with multiple GPUs. When I switch to using torch.distributed, I started getting the error below. I am not sure if this is related to what I see in this thread.

File "/home/kelmaghr/benchmarks/transformers/src/transformers/trainer_callback.py", line 387, in call_event
File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
wandb.log({})
File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
**kwargs, raise wandb.Error("You must call wandb.init() before {}()".format(name))**kwargs,

wandb.log({})
File "/home/kelmaghr/benchmarks/transformers/src/transformers/integrations.py", line 427, in on_train_end
File "/home/kelmaghr/benchmarks/transformers/src/transformers/integrations.py", line 427, in on_train_end
raise wandb.Error("You must call wandb.init() before {}()".format(name))wandb.errors.error File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper

. wandb.log({})Error**kwargs,
wandb.errors.error:
You must call wandb.init() before wandb.log() File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
.raise wandb.Error("You must call wandb.init() before {}()".format(name)) File "/home/kelmaghr/benchmarks/transformers/src/transformers/integrations.py", line 427, in on_train_end

Error
: You must call wandb.init() before wandb.log()
wandb.errors.error. Errorwandb.log({}):
You must call wandb.init() before wandb.log()wandb.log({}) File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
raise wandb.Error("You must call wandb.init() before {}()".format(name))

File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
wandb.errors.error.Error: You must call wandb.init() before wandb.log()
raise wandb.Error("You must call wandb.init() before {}()".format(name))
raise wandb.Error("You must call wandb.init() before {}()".format(name))wandb.log({})

turian · 2021-04-15T02:43:45Z

I have the same error, trying to get WandBLogger working on grid.ai

roman-bushuiev · 2023-01-15T18:56:58Z

It is solved here #13166.

rmrao added bug Something isn't working help wanted Open to be worked on labels Feb 28, 2020

vanpelt mentioned this issue Mar 14, 2020

Allow reinits in sub procs #1149

Closed

5 tasks

williamFalcon mentioned this issue Apr 4, 2020

Wandb bug/wandb multi #1360

Merged

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed in #2029 Jun 1, 2020

borisdayma mentioned this issue Jun 3, 2020

Sweeps not initializing properly with PyTorch Lightening wandb/wandb#1059

Closed

Borda added the logger Related to the Loggers label Aug 4, 2020

EricCousineau-TRI mentioned this issue Feb 10, 2021

wandb: Fix example rendering for docs #5905

Merged

8 tasks

kfkelvinng mentioned this issue May 19, 2021

[CLI] wandb/wandb#1738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WandbLogger cannot be used with 'ddp' #981

WandbLogger cannot be used with 'ddp' #981

rmrao commented Feb 28, 2020

github-actions bot commented Feb 28, 2020

rmrao commented Feb 29, 2020

rmrao commented Feb 29, 2020

Borda commented Feb 29, 2020

rmrao commented Mar 4, 2020

rmrao commented Mar 4, 2020

vanpelt commented Mar 14, 2020

rmrao commented Mar 14, 2020

Borda commented Mar 15, 2020

VitorGuizilini commented Apr 1, 2020

julianmack commented Apr 28, 2020 •

edited

Loading

williamFalcon commented Apr 28, 2020

vanpelt commented Apr 29, 2020

stas6626 commented May 1, 2020

parasj commented May 7, 2020

parasj commented May 7, 2020

sooheon commented May 16, 2020 •

edited

Loading

kaoutar55 commented Feb 26, 2021

turian commented Apr 15, 2021

roman-bushuiev commented Jan 15, 2023

WandbLogger cannot be used with 'ddp' #981

WandbLogger cannot be used with 'ddp' #981

Comments

rmrao commented Feb 28, 2020

🐛 Bug

To Reproduce

github-actions bot commented Feb 28, 2020

rmrao commented Feb 29, 2020

rmrao commented Feb 29, 2020

Borda commented Feb 29, 2020

rmrao commented Mar 4, 2020

rmrao commented Mar 4, 2020

vanpelt commented Mar 14, 2020

rmrao commented Mar 14, 2020

Borda commented Mar 15, 2020

VitorGuizilini commented Apr 1, 2020

julianmack commented Apr 28, 2020 • edited Loading

williamFalcon commented Apr 28, 2020

vanpelt commented Apr 29, 2020

stas6626 commented May 1, 2020

parasj commented May 7, 2020

parasj commented May 7, 2020

sooheon commented May 16, 2020 • edited Loading

kaoutar55 commented Feb 26, 2021

turian commented Apr 15, 2021

roman-bushuiev commented Jan 15, 2023

julianmack commented Apr 28, 2020 •

edited

Loading

sooheon commented May 16, 2020 •

edited

Loading