Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WandbLogger disables cloud checkpointing in Trainer default_root_dir #16195

Open
turian opened this issue Dec 25, 2022 · 1 comment
Open

WandbLogger disables cloud checkpointing in Trainer default_root_dir #16195

turian opened this issue Dec 25, 2022 · 1 comment
Labels
bug Something isn't working checkpointing Related to checkpointing logger: wandb Weights & Biases

Comments

@turian
Copy link
Contributor

turian commented Dec 25, 2022

Bug description

Cloud checkpoints are cool! But once you use the WandbLogger, no cloud checkpoints (or anything really) is saved to trainer.default_root_dir. The model is checkpointed as a Wandb artifact, which is cool, but I want it also in trainer.default_root_dir's s3 bucket.

There reason I want this:

  • wandb checkpoints are good if you want to go back and find something from six months ago.
  • However, they are a pain to use if you are in back-to-back experimental cycle, rather than just remembering the S3 location and using it. Additionally it is incompatible with @skypilot-org storage, which is a much cleaner idiom / pattern.

Related bug #16196 . See 'More info' at the bottom of this issue.

There are some related issues:
#14325
#5935
#11769
https://github.com/Lightning-AI/lightning/issues/15539
#2318
#2161
but I haven't found this specifically.

How to reproduce the bug

Here is a google colab that replicates this and a related bag. I share the code for both because it's easier to configure the AWS credentials and see both bugs simultaneously.

Copying and pasting the most important bit (but see the colab for a full minimal replication):

from pytorch_lightning.loggers import WandbLogger

def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    logger = WandbLogger(
        project="boringbug",
        log_model="all",
    )

    model = BoringModel()
    trainer = Trainer(
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        logger=logger,
        default_root_dir = f"{BORING_BUCKET}/wandbtest/"
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)

run()


### Error messages and logs

There is no error message, but `{BORING_BUCKET}/wandbtest/` (an S3 location) is empty, and the checkpoint is only in Wandb.

### Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 11.6
  • Lightning:
    • lightning-utilities: 0.5.0
    • pytorch-lightning: 1.8.6
    • torch: 1.13.0+cu116
    • torchaudio: 0.13.0+cu116
    • torchmetrics: 0.11.0
    • torchsummary: 1.5.1
    • torchtext: 0.14.0
    • torchvision: 0.14.0+cu116
  • Packages:
    • absl-py: 1.3.0
    • aeppl: 0.0.33
    • aesara: 2.7.9
    • aiobotocore: 2.4.2
    • aiohttp: 3.8.3
    • aioitertools: 0.11.0
    • aiosignal: 1.3.1
    • alabaster: 0.7.12
    • albumentations: 1.2.1
    • altair: 4.2.0
    • appdirs: 1.4.4
    • arviz: 0.12.1
    • astor: 0.8.1
    • astropy: 4.3.1
    • astunparse: 1.6.3
    • async-timeout: 4.0.2
    • atari-py: 0.2.9
    • atomicwrites: 1.4.1
    • attrs: 22.1.0
    • audioread: 3.0.0
    • autograd: 1.5
    • awscli: 1.25.60
    • babel: 2.11.0
    • backcall: 0.2.0
    • beautifulsoup4: 4.6.3
    • bleach: 5.0.1
    • blis: 0.7.9
    • bokeh: 2.3.3
    • boto3: 1.24.59
    • botocore: 1.27.59
    • branca: 0.6.0
    • bs4: 0.0.1
    • cachecontrol: 0.12.11
    • cachetools: 5.2.0
    • catalogue: 2.0.8
    • certifi: 2022.12.7
    • cffi: 1.15.1
    • cftime: 1.6.2
    • chardet: 3.0.4
    • charset-normalizer: 2.1.1
    • click: 7.1.2
    • clikit: 0.6.2
    • cloudpickle: 1.5.0
    • cmake: 3.22.6
    • cmdstanpy: 1.0.8
    • colorama: 0.3.7
    • colorcet: 3.0.1
    • colorlover: 0.3.0
    • community: 1.0.0b1
    • confection: 0.0.3
    • cons: 0.4.5
    • contextlib2: 0.5.5
    • convertdate: 2.4.0
    • crashtest: 0.3.1
    • crcmod: 1.7
    • cryptography: 38.0.4
    • cufflinks: 0.17.3
    • cupy-cuda11x: 11.0.0
    • cvxopt: 1.3.0
    • cvxpy: 1.2.2
    • cycler: 0.11.0
    • cymem: 2.0.7
    • cython: 0.29.32
    • daft: 0.0.4
    • dask: 2022.2.1
    • datascience: 0.17.5
    • db-dtypes: 1.0.5
    • debugpy: 1.0.0
    • decorator: 4.4.2
    • defusedxml: 0.7.1
    • descartes: 1.1.0
    • dill: 0.3.6
    • distributed: 2022.2.1
    • dlib: 19.24.0
    • dm-tree: 0.1.7
    • dnspython: 2.2.1
    • docker-pycreds: 0.4.0
    • docutils: 0.16
    • dopamine-rl: 1.0.5
    • earthengine-api: 0.1.335
    • easydict: 1.10
    • ecos: 2.0.10
    • editdistance: 0.5.3
    • en-core-web-sm: 3.4.1
    • entrypoints: 0.4
    • ephem: 4.1.3
    • et-xmlfile: 1.1.0
    • etils: 0.9.0
    • etuples: 0.3.8
    • fa2: 0.3.5
    • fastai: 2.7.10
    • fastcore: 1.5.27
    • fastdownload: 0.0.7
    • fastdtw: 0.3.4
    • fastjsonschema: 2.16.2
    • fastprogress: 1.0.3
    • fastrlock: 0.8.1
    • feather-format: 0.4.1
    • filelock: 3.8.2
    • firebase-admin: 5.3.0
    • fix-yahoo-finance: 0.0.22
    • flask: 1.1.4
    • flatbuffers: 1.12
    • folium: 0.12.1.post1
    • frozenlist: 1.3.3
    • fsspec: 2022.11.0
    • future: 0.16.0
    • gast: 0.4.0
    • gdal: 2.2.2
    • gdown: 4.4.0
    • gensim: 3.6.0
    • geographiclib: 1.52
    • geopy: 1.17.0
    • gin-config: 0.5.0
    • gitdb: 4.0.10
    • gitpython: 3.1.29
    • glob2: 0.7
    • google: 2.0.3
    • google-api-core: 2.8.2
    • google-api-python-client: 1.12.11
    • google-auth: 2.15.0
    • google-auth-httplib2: 0.0.4
    • google-auth-oauthlib: 0.4.6
    • google-cloud-bigquery: 3.3.6
    • google-cloud-bigquery-storage: 2.16.2
    • google-cloud-core: 2.3.2
    • google-cloud-datastore: 2.9.0
    • google-cloud-firestore: 2.7.2
    • google-cloud-language: 2.6.1
    • google-cloud-storage: 2.5.0
    • google-cloud-translate: 3.8.4
    • google-colab: 1.0.0
    • google-crc32c: 1.5.0
    • google-pasta: 0.2.0
    • google-resumable-media: 2.4.0
    • googleapis-common-protos: 1.57.0
    • googledrivedownloader: 0.4
    • graphviz: 0.10.1
    • greenlet: 2.0.1
    • grpcio: 1.51.1
    • grpcio-status: 1.48.2
    • gspread: 3.4.2
    • gspread-dataframe: 3.0.8
    • gym: 0.25.2
    • gym-notices: 0.0.8
    • h5py: 3.1.0
    • heapdict: 1.0.1
    • hijri-converter: 2.2.4
    • holidays: 0.17.2
    • holoviews: 1.14.9
    • html5lib: 1.0.1
    • httpimport: 0.5.18
    • httplib2: 0.17.4
    • httpstan: 4.6.1
    • humanize: 0.5.1
    • hyperopt: 0.1.2
    • idna: 2.10
    • imageio: 2.9.0
    • imagesize: 1.4.1
    • imbalanced-learn: 0.8.1
    • imblearn: 0.0
    • imgaug: 0.4.0
    • importlib-metadata: 5.1.0
    • importlib-resources: 5.10.1
    • imutils: 0.5.4
    • inflect: 2.1.0
    • intel-openmp: 2022.2.1
    • intervaltree: 2.1.0
    • ipykernel: 5.3.4
    • ipython: 7.9.0
    • ipython-genutils: 0.2.0
    • ipython-sql: 0.3.9
    • ipywidgets: 7.7.1
    • itsdangerous: 1.1.0
    • jax: 0.3.25
    • jaxlib: 0.3.25+cuda11.cudnn805
    • jieba: 0.42.1
    • jinja2: 2.11.3
    • jmespath: 0.9.3
    • joblib: 1.2.0
    • jpeg4py: 0.1.4
    • jsonschema: 4.3.3
    • jupyter-client: 6.1.12
    • jupyter-console: 6.1.0
    • jupyter-core: 5.1.0
    • jupyterlab-widgets: 3.0.4
    • kaggle: 1.5.12
    • kapre: 0.3.7
    • keras: 2.9.0
    • keras-preprocessing: 1.1.2
    • keras-vis: 0.4.1
    • kiwisolver: 1.4.4
    • korean-lunar-calendar: 0.3.1
    • langcodes: 3.3.0
    • libclang: 14.0.6
    • librosa: 0.8.1
    • lightgbm: 2.2.3
    • lightning-utilities: 0.5.0
    • llvmlite: 0.39.1
    • lmdb: 0.99
    • locket: 1.0.0
    • logical-unification: 0.4.5
    • lunarcalendar: 0.0.9
    • lxml: 4.9.2
    • markdown: 3.4.1
    • markupsafe: 2.0.1
    • marshmallow: 3.19.0
    • matplotlib: 3.2.2
    • matplotlib-venn: 0.11.7
    • minikanren: 1.0.3
    • missingno: 0.5.1
    • mistune: 0.8.4
    • mizani: 0.7.3
    • mkl: 2019.0
    • mlxtend: 0.14.0
    • more-itertools: 9.0.0
    • moviepy: 0.2.3.5
    • mpmath: 1.2.1
    • msgpack: 1.0.4
    • multidict: 6.0.3
    • multipledispatch: 0.6.0
    • multitasking: 0.0.11
    • murmurhash: 1.0.9
    • music21: 5.5.0
    • natsort: 5.5.0
    • nbconvert: 5.6.1
    • nbformat: 5.7.0
    • netcdf4: 1.6.2
    • networkx: 2.8.8
    • nibabel: 3.0.2
    • nltk: 3.7
    • notebook: 5.7.16
    • numba: 0.56.4
    • numexpr: 2.8.4
    • numpy: 1.21.6
    • oauth2client: 4.1.3
    • oauthlib: 3.2.2
    • okgrade: 0.4.3
    • olefile: 0.45.1
    • opencv-contrib-python: 4.6.0.66
    • opencv-python: 4.6.0.66
    • opencv-python-headless: 4.6.0.66
    • openpyxl: 3.0.10
    • opt-einsum: 3.3.0
    • osqp: 0.6.2.post0
    • packaging: 21.3
    • palettable: 3.3.0
    • pandas: 1.3.5
    • pandas-datareader: 0.9.0
    • pandas-gbq: 0.17.9
    • pandas-profiling: 1.4.1
    • pandocfilters: 1.5.0
    • panel: 0.12.1
    • param: 1.12.3
    • parso: 0.8.3
    • partd: 1.3.0
    • pastel: 0.2.1
    • pathlib: 1.0.1
    • pathtools: 0.1.2
    • pathy: 0.10.1
    • patsy: 0.5.3
    • pep517: 0.13.0
    • pexpect: 4.8.0
    • pickleshare: 0.7.5
    • pillow: 7.1.2
    • pip: 21.1.3
    • pip-tools: 6.2.0
    • platformdirs: 2.6.0
    • plotly: 5.5.0
    • plotnine: 0.8.0
    • pluggy: 0.7.1
    • pooch: 1.6.0
    • portpicker: 1.3.9
    • prefetch-generator: 1.0.3
    • preshed: 3.0.8
    • prettytable: 3.5.0
    • progressbar2: 3.38.0
    • prometheus-client: 0.15.0
    • promise: 2.3
    • prompt-toolkit: 2.0.10
    • prophet: 1.1.1
    • proto-plus: 1.22.1
    • protobuf: 3.19.6
    • psutil: 5.4.8
    • psycopg2: 2.9.5
    • ptyprocess: 0.7.0
    • py: 1.11.0
    • pyarrow: 9.0.0
    • pyasn1: 0.4.8
    • pyasn1-modules: 0.2.8
    • pycocotools: 2.0.6
    • pycparser: 2.21
    • pyct: 0.4.8
    • pydantic: 1.10.2
    • pydata-google-auth: 1.4.0
    • pydot: 1.3.0
    • pydot-ng: 2.0.0
    • pydotplus: 2.0.2
    • pydrive: 1.3.1
    • pyemd: 0.5.1
    • pyerfa: 2.0.0.1
    • pygments: 2.6.1
    • pygobject: 3.26.1
    • pylev: 1.4.0
    • pymc: 4.1.4
    • pymeeus: 0.5.12
    • pymongo: 4.3.3
    • pymystem3: 0.2.0
    • pyopengl: 3.1.6
    • pyopenssl: 22.1.0
    • pyparsing: 3.0.9
    • pyrsistent: 0.19.2
    • pysimdjson: 3.2.0
    • pysndfile: 1.3.8
    • pysocks: 1.7.1
    • pystan: 3.3.0
    • pytest: 3.6.4
    • python-apt: 0.0.0
    • python-dateutil: 2.8.2
    • python-louvain: 0.16
    • python-slugify: 7.0.0
    • python-utils: 3.4.5
    • pytorch-lightning: 1.8.6
    • pytz: 2022.6
    • pyviz-comms: 2.2.1
    • pywavelets: 1.4.1
    • pyyaml: 5.4.1
    • pyzmq: 23.2.1
    • qdldl: 0.1.5.post2
    • qudida: 0.0.4
    • regex: 2022.6.2
    • requests: 2.23.0
    • requests-oauthlib: 1.3.1
    • resampy: 0.4.2
    • roman: 2.0.0
    • rpy2: 3.5.5
    • rsa: 4.7.2
    • s3fs: 2022.11.0
    • s3transfer: 0.6.0
    • scikit-image: 0.18.3
    • scikit-learn: 1.0.2
    • scipy: 1.7.3
    • screen-resolution-extra: 0.0.0
    • scs: 3.2.2
    • seaborn: 0.11.2
    • send2trash: 1.8.0
    • sentry-sdk: 1.9.0
    • setproctitle: 1.3.2
    • setuptools: 57.4.0
    • setuptools-git: 1.2
    • shapely: 2.0.0
    • shortuuid: 1.0.11
    • six: 1.15.0
    • sklearn-pandas: 1.8.0
    • smart-open: 6.3.0
    • smmap: 5.0.0
    • snowballstemmer: 2.2.0
    • sortedcontainers: 2.4.0
    • soundfile: 0.11.0
    • spacy: 3.4.4
    • spacy-legacy: 3.0.10
    • spacy-loggers: 1.0.4
    • sphinx: 1.8.6
    • sphinxcontrib-serializinghtml: 1.1.5
    • sphinxcontrib-websupport: 1.2.4
    • sqlalchemy: 1.4.45
    • sqlparse: 0.4.3
    • srsly: 2.4.5
    • statsmodels: 0.12.2
    • sympy: 1.7.1
    • tables: 3.7.0
    • tabulate: 0.8.10
    • tblib: 1.7.0
    • tenacity: 8.1.0
    • tensorboard: 2.9.1
    • tensorboard-data-server: 0.6.1
    • tensorboard-plugin-wit: 1.8.1
    • tensorboardx: 2.5.1
    • tensorflow: 2.9.2
    • tensorflow-datasets: 4.6.0
    • tensorflow-estimator: 2.9.0
    • tensorflow-gcs-config: 2.9.1
    • tensorflow-hub: 0.12.0
    • tensorflow-io-gcs-filesystem: 0.28.0
    • tensorflow-metadata: 1.12.0
    • tensorflow-probability: 0.17.0
    • termcolor: 2.1.1
    • terminado: 0.13.3
    • testpath: 0.6.0
    • text-unidecode: 1.3
    • textblob: 0.15.3
    • thinc: 8.1.5
    • threadpoolctl: 3.1.0
    • tifffile: 2022.10.10
    • toml: 0.10.2
    • tomli: 2.0.1
    • toolz: 0.12.0
    • torch: 1.13.0+cu116
    • torchaudio: 0.13.0+cu116
    • torchmetrics: 0.11.0
    • torchsummary: 1.5.1
    • torchtext: 0.14.0
    • torchvision: 0.14.0+cu116
    • tornado: 6.0.4
    • tqdm: 4.64.1
    • traitlets: 5.7.1
    • tweepy: 3.10.0
    • typeguard: 2.7.1
    • typer: 0.7.0
    • typing-extensions: 4.4.0
    • tzlocal: 1.5.1
    • uritemplate: 3.0.1
    • urllib3: 1.25.11
    • vega-datasets: 0.9.0
    • wandb: 0.13.7
    • wasabi: 0.10.1
    • wcwidth: 0.2.5
    • webargs: 8.2.0
    • webencodings: 0.5.1
    • werkzeug: 1.0.1
    • wheel: 0.38.4
    • widgetsnbextension: 3.6.1
    • wordcloud: 1.8.2.2
    • wrapt: 1.14.1
    • xarray: 2022.12.0
    • xarray-einstats: 0.4.0
    • xgboost: 0.90
    • xkit: 0.0.0
    • xlrd: 1.2.0
    • xlwt: 1.3.0
    • yarl: 1.8.2
    • yellowbrick: 1.5
    • zict: 2.2.0
    • zipp: 3.11.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.8.16
    • version: Proposal for help #1 SMP Fri Aug 26 08:44:51 UTC 2022

### More info


What I really want for christmas this year, all packaged together:
* I have a CSVLogger that persists to s3.
* I have a WandbLogger that saves checkpoints to Wandb.
* I have an S3 `trainer.default_root_dir` that also saves checkpoints to s3.

cc @awaelchli @morganmcg1 @borisdayma @scottire @parambharat @manangoel99
@turian turian added the needs triage Waiting to be triaged by maintainers label Dec 25, 2022
@awaelchli awaelchli added bug Something isn't working checkpointing Related to checkpointing logger: wandb Weights & Biases and removed needs triage Waiting to be triaged by maintainers labels Jan 12, 2023
@stale
Copy link

stale bot commented Apr 14, 2023

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 14, 2023
@stale stale bot removed the won't fix This will not be worked on label Jul 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing logger: wandb Weights & Biases
Projects
None yet
Development

No branches or pull requests

2 participants