You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cloud checkpoints are cool! But once you use the WandbLogger, no cloud checkpoints (or anything really) is saved to trainer.default_root_dir. The model is checkpointed as a Wandb artifact, which is cool, but I want it also in trainer.default_root_dir's s3 bucket.
There reason I want this:
wandb checkpoints are good if you want to go back and find something from six months ago.
However, they are a pain to use if you are in back-to-back experimental cycle, rather than just remembering the S3 location and using it. Additionally it is incompatible with @skypilot-orgstorage, which is a much cleaner idiom / pattern.
Related bug #16196 . See 'More info' at the bottom of this issue.
Here is a google colab that replicates this and a related bag. I share the code for both because it's easier to configure the AWS credentials and see both bugs simultaneously.
Copying and pasting the most important bit (but see the colab for a full minimal replication):
### Error messages and logs
There is no error message, but `{BORING_BUCKET}/wandbtest/` (an S3 location) is empty, and the checkpoint is only in Wandb.
### Environment
### More info
What I really want for christmas this year, all packaged together:
* I have a CSVLogger that persists to s3.
* I have a WandbLogger that saves checkpoints to Wandb.
* I have an S3 `trainer.default_root_dir` that also saves checkpoints to s3.
cc @awaelchli @morganmcg1 @borisdayma @scottire @parambharat @manangoel99
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
Bug description
Cloud checkpoints are cool! But once you use the WandbLogger, no cloud checkpoints (or anything really) is saved to
trainer.default_root_dir
. The model is checkpointed as a Wandb artifact, which is cool, but I want it also intrainer.default_root_dir
's s3 bucket.There reason I want this:
wandb
checkpoints are good if you want to go back and find something from six months ago.Related bug #16196 . See 'More info' at the bottom of this issue.
There are some related issues:
#14325
#5935
#11769
https://github.com/Lightning-AI/lightning/issues/15539
#2318
#2161
but I haven't found this specifically.
How to reproduce the bug
Here is a google colab that replicates this and a related bag. I share the code for both because it's easier to configure the AWS credentials and see both bugs simultaneously.
Copying and pasting the most important bit (but see the colab for a full minimal replication):
The text was updated successfully, but these errors were encountered: