Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[checkpoint] Resolve 2 different checkpoint loading paths across fit vs validate/test/predict #9405

Closed
ananthsub opened this issue Sep 9, 2021 · 2 comments · Fixed by #10061
Assignees
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement let's do it! approved to implement refactor

Comments

@ananthsub
Copy link
Contributor

Proposed refactoring or deprecation

Consolidate checkpoint loading code across fit and validate/test/predict

Motivation

We have 2 different code paths for checkpoint loading

  1. Trainer.fit and the constructor argument resume_from_checkpoint
  2. ckpt_path passed to Trainer.validate/test/predict

Offering multiple code paths here risks divergence. Lightning must ensure a consistent experience for checkpoint loading across these different entry points.

Background

These are the paths today.

Trainer.fit:

Trainer.validate/test/predict:

Notably:

Pitch

  1. Unify in CheckpointConnector
    CheckpointConnector.restore_model is a superset of CheckpointConnector.restore_model_weights which suggests we don't need both.

  2. Unify in trainer.py

The overall sequence in _run can be shared:

  • Load & validate the checkpoint dict through the training type plugin
  • Load the datamodule state
  • Load the LightningModule state through the training type plugin
  • Load the callback states
  • Load the loop states
  • If fitting, load the precision, optimizer, and LR scheduler states

We have 5 different properties exposed for the checkpoint path to resume from (excluding HPC stuff):

It's unclear what the lifecycle of these properties should be. Do successive calls to validate/test/predict end up relying on this?

Proposal:

  • Pass the checkpoint path to resume from as an argument to Trainer._run to avoid our dependency on these properties
  • Deprecate these properties
  1. Unify in the Trainer API
    Proposal: deprecate resume_from_checkpoint from the Trainer constructor, and add a new argument ckpt_path to Trainer.fit . This provides API consistency with validate/test/predict

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@ananthsub
Copy link
Contributor Author

paraphrasing @awaelchli

Remove the trainer argument will greatly impact usability and configurability through CLI. To that you will counter argue that Trainer should not be a configuration system. Still, sad to see that we are just taking it away.

but how does the CLI work today for passing ckpt_path to validate, test, or predict ? why is this any different from that?

Related to that I was hoping we could take this one up some day: #5339

I commented there with what could be viewed as an extension to the trainer argument discussion here

@carmocca
Copy link
Contributor

but how does the CLI work today for passing ckpt_path to validate, test, or predict ? why is this any different from that?

In the CLI you specify the fit checkpoint with

python script.py fit --trainer.resume_from_checkpoint=...

and the test checkpoint with:

python script.py test ckpt_path=...

This PR would make it consistent.


I personally think there is a lot of value in doing this - if this wasn't pursued then I'd suggest renaming resume_from_checkpoint to fit_ckpt_path to address the confusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement let's do it! approved to implement refactor
Projects
None yet
4 participants