Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WandbLogger cannot be used with 'ddp' #981

Closed
rmrao opened this issue Feb 28, 2020 · 20 comments · Fixed by #1360 or #2029
Closed

WandbLogger cannot be used with 'ddp' #981

rmrao opened this issue Feb 28, 2020 · 20 comments · Fixed by #1360 or #2029
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers

Comments

@rmrao
Copy link
Contributor

rmrao commented Feb 28, 2020

🐛 Bug

wandb modifies init such that a child process calling init returns None if the master process has called init. This seems to cause a bug with ddp, and results in rank zero having experiment = None, which crashes the program.

To Reproduce

Can be reproduced with the basic MNIST gpu template, simply add a WandbLogger and pass 'ddp' as the distributed backend.

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 331, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 757, in run_pretrain_routine
    self.logger.log_hyperparams(ref_model.hparams)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/logging/base.py", line 14, in wrapped_fn
    fn(self, *args, **kwargs)
  File "/home/rmrao/anaconda3/lib/python3.6/site-packages/pytorch_lightning/logging/wandb.py", line 79, in log_hyperparams
    self.experiment.config.update(params)
AttributeError: 'NoneType' object has no attribute 'config'

This occurs with the latest wandb version and with pytorch-lightning 0.6.

@rmrao rmrao added bug Something isn't working help wanted Open to be worked on labels Feb 28, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@rmrao
Copy link
Contributor Author

rmrao commented Feb 29, 2020

Some hacky solutions: calling reinit=True for wandb, adding or this terrible hack:

def init_ddp_connection(self, *args, **kwargs):
    super().init_ddp_connection(*args, **kwargs)

    if torch.distributed.get_rank() == 0:
        import wandb
        wandb.run = None

These both seem to only kind-of work and result in multiple independent calls to wandb.init. I think the ideal solution is that the experiment is only ever initialized on rank zero. However doing this means that wandb cannot be initialized in the master thread at all.

Better than this probably requires some changes to the wandb API.

@rmrao
Copy link
Contributor Author

rmrao commented Feb 29, 2020

Following up slightly - my hacky solution doesn't really work. It's easy enough though to get the rank zero only solution to work. If this seems like a reasonable solution, let me know and I'll submit a PR.

@Borda
Copy link
Member

Borda commented Feb 29, 2020

well, have observed some issues with wandb earlier #906 could you check it?

@rmrao
Copy link
Contributor Author

rmrao commented Mar 4, 2020

Hmm, I think this is a slightly different issue (I'm running on Ubuntu so I don't think that's the issue). Pickling also works correctly.

@rmrao
Copy link
Contributor Author

rmrao commented Mar 4, 2020

This particular problem I think stems from this part of the wandb.init(...) code:

def init(...):
    ...
    # If a thread calls wandb.init() it will get the same Run object as
    # the parent. If a child process with distinct memory space calls
    # wandb.init(), it won't get an error, but it will get a result of
    # None.
    # This check ensures that a child process can safely call wandb.init()
    # after a parent has (only the parent will create the Run object).
    # This doesn't protect against the case where the parent doesn't call
    # wandb.init but two children do.
    if run or os.getenv(env.INITED):
        return run

Child processes end up getting None for the wandb run object, which causes logging to fail. There are probably two reasonable and complementary solutions:

  1. The main thread should avoid creating a wandb experiment unless absolutely necessary.

Right now, this is the only part of the logging code that the parent thread calls (I assume it's called when pickling):

def __getstate__(self):
    state = self.__dict__.copy()
    # cannot be pickled
    state['_experiment'] = None
    # args needed to reload correct experiment
    state['_id'] = self.experiment.id
    return state

If this is changed to:

def __getstate__(self):
    state = self.__dict__.copy()
    # args needed to reload correct experiment
    if self._experiment is not None:
        state['_id'] = self._experiment.id
    else:
        state['_id'] = None

    # cannot be pickled
    state['_experiment'] = None
    return state

That will ensure that unless the user explicitly logs something or creates the wandb experiment first, then the main thread will not try to create an experiment. Since subsequent logging / saving code is wrapped by the @rank_zero_only decorator, this will generally solve the issue in the base case.

It's also possible that these properties are also called by master. Ideally they would be wrapped to not create the experiment unless it had been already created (i.e. experiment should only be created by a function that is wrapped with the @rank_zero_only decorator).

  1. If the main thread has created an experiment, rank zero should be passed the re-init argument.

wandb does allow you to reinitialize the experiment. I tried to play around with this a little bit and got some errors, but in theory adding this:

wandb.init(..., reinit=dist.is_available() and dist.is_initialized() and dist.get_rank() == 0)

should force a re-initialization when wandb is already initialzed for rank zero.

@vanpelt
Copy link
Contributor

vanpelt commented Mar 14, 2020

@rmrao I just made a PR. It should be safe to always set reinit=True. I wasn't able to reproduce the specific environment you mentioned. Can you share some code or a Colab with me to be sure this works ok?

@rmrao
Copy link
Contributor Author

rmrao commented Mar 14, 2020

I think I had tried setting reinit without fixing other calls to avoid creating the experiment unless necessary. I don’t think this will produce errors.

I haven’t experimented with reinit significantly and I’m not sure what it does in terms of creating a new run in wandb, etc. but I don’t think I have a setup that explicitly causes an error with reinit.

@Borda
Copy link
Member

Borda commented Mar 15, 2020

It would be good also add test for logger in ddp mode

@VitorGuizilini
Copy link
Contributor

Has there been any progress on this? What is the fix to enable WandbLogger on 'ddp'?

@julianmack
Copy link

julianmack commented Apr 28, 2020

Is there any update to this? I'm also hoping to use wandb with ddp. I've upgraded to the most recent release 0.7.5 but the problem persists

@williamFalcon
Copy link
Contributor

@vanpelt @shawnlewis

@vanpelt
Copy link
Contributor

vanpelt commented Apr 29, 2020

Can someone provide a small script to reproduce this? We can work on a fix and include it in the next release.

@stas6626
Copy link
Contributor

stas6626 commented May 1, 2020

@julianmack try 0.7.3

@parasj
Copy link

parasj commented May 7, 2020

I found a similar bug with the WandB logger where the checkpointing callback expects the logger.name property to not be None

https://github.com/PyTorchLightning/pytorch-lightning/blob/3a642601e84c3abf1f1b438f9acc932a1f150f7f/pytorch_lightning/trainer/callback_config.py#L54-L59

However, WandB will set the logger.name and logger.version properties to None if experiments are not being used.

https://github.com/PyTorchLightning/pytorch-lightning/blob/3a642601e84c3abf1f1b438f9acc932a1f150f7f/pytorch_lightning/loggers/wandb.py#L124-L133

@parasj
Copy link

parasj commented May 7, 2020

I worked around this issue by adding the following line to the top of my training job

setattr(WandbLogger, 'name', property(lambda self: self._name))

@sooheon
Copy link

sooheon commented May 16, 2020

Just to clarify @parasj's solution, presumably you also have to pass in some name as a keyword arg while constructing the logger.

Unfortunately this means you're now responsible for generating unique names for subsequent runs.

@kaoutar55
Copy link

I am getting a similar issue when using wandb in the DDP with hugging face implementation of Bert. I am running on one node with multiple GPUs. When I switch to using torch.distributed, I started getting the error below. I am not sure if this is related to what I see in this thread.

File "/home/kelmaghr/benchmarks/transformers/src/transformers/trainer_callback.py", line 387, in call_event
File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
wandb.log({})
File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
**kwargs, raise wandb.Error("You must call wandb.init() before {}()".format(name))**kwargs,

wandb.log({})
File "/home/kelmaghr/benchmarks/transformers/src/transformers/integrations.py", line 427, in on_train_end
File "/home/kelmaghr/benchmarks/transformers/src/transformers/integrations.py", line 427, in on_train_end
raise wandb.Error("You must call wandb.init() before {}()".format(name))wandb.errors.error File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper

. wandb.log({})Error**kwargs,
wandb.errors.error:
You must call wandb.init() before wandb.log() File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
.raise wandb.Error("You must call wandb.init() before {}()".format(name)) File "/home/kelmaghr/benchmarks/transformers/src/transformers/integrations.py", line 427, in on_train_end

Error
: You must call wandb.init() before wandb.log()
wandb.errors.error. Errorwandb.log({}):
You must call wandb.init() before wandb.log()wandb.log({}) File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
raise wandb.Error("You must call wandb.init() before {}()".format(name))

File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper
wandb.errors.error.Error: You must call wandb.init() before wandb.log()
raise wandb.Error("You must call wandb.init() before {}()".format(name))
raise wandb.Error("You must call wandb.init() before {}()".format(name))wandb.log({})

@turian
Copy link
Contributor

turian commented Apr 15, 2021

I have the same error, trying to get WandBLogger working on grid.ai

@roman-bushuiev
Copy link

It is solved here #13166.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on logger Related to the Loggers
Projects
None yet