WandbLogger disables cloud checkpointing in Trainer default_root_dir #16195

turian opened this issue Dec 25, 2022 · 1 comment

bug Something isn't working checkpointing Related to checkpointing logger: wandb Weights & Biases


turian commented Dec 25, 2022

Bug description

Cloud checkpoints are cool! But once you use the WandbLogger, no cloud checkpoints (or anything really) is saved to trainer.default_root_dir. The model is checkpointed as a Wandb artifact, which is cool, but I want it also in trainer.default_root_dir's s3 bucket.

There reason I want this:

  • wandb checkpoints are good if you want to go back and find something from six months ago.
  • However, they are a pain to use if you are in back-to-back experimental cycle, rather than just remembering the S3 location and using it. Additionally it is incompatible with @skypilot-org storage, which is a much cleaner idiom / pattern.

Related bug #16196 . See 'More info' at the bottom of this issue.

There are some related issues:
but I haven't found this specifically.

How to reproduce the bug

Here is a google colab that replicates this and a related bag. I share the code for both because it's easier to configure the AWS credentials and see both bugs simultaneously.

Copying and pasting the most important bit (but see the colab for a full minimal replication):

from pytorch_lightning.loggers import WandbLogger

def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    logger = WandbLogger(

    model = BoringModel()
    trainer = Trainer(
        default_root_dir = f"{BORING_BUCKET}/wandbtest/"
    ), train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


### Error messages and logs

There is no error message, but `{BORING_BUCKET}/wandbtest/` (an S3 location) is empty, and the checkpoint is only in Wandb.

Environment

  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.8.16
    • version: Proposal for help #1 SMP Fri Aug 26 08:44:51 UTC 2022

### More info

What I really want for christmas this year, all packaged together:
* I have a CSVLogger that persists to s3.
* I have a WandbLogger that saves checkpoints to Wandb.
* I have an S3 `trainer.default_root_dir` that also saves checkpoints to s3.

cc @awaelchli @morganmcg1 @borisdayma @scottire @parambharat @manangoel99
@awaelchli awaelchli added bug Something isn't working checkpointing Related to checkpointing logger: wandb Weights & Biases and removed needs triage Waiting to be triaged by maintainers labels Jan 12, 2023
