Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not-yet-existing resume_from_checkpoint for auto-resubmit #4366

Closed
tarepan opened this issue Oct 26, 2020 · 4 comments
Closed

Not-yet-existing resume_from_checkpoint for auto-resubmit #4366

tarepan opened this issue Oct 26, 2020 · 4 comments
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@tarepan
Copy link
Contributor

tarepan commented Oct 26, 2020

🚀 Feature

Accept Not-yet-existing resume_from_checkpoint in Trainer for automatic training resume / auto-resubmit.

Motivation

In cloud ML training services (e.g. Google AI platform training, AWS SageMaker, AWS Batch), there are Job auto-retry feature.
If we can specify checkpoint path, Job auto-retry can be used for training resume / resubmit.
Unfortunately, PyTorch-Lightning cannot specify Non-(yet-)existing file as resume_from_checkpoint argument of Trainer, it simply raise an error.
The motivation of this feature request is enabling training resume through Not-yet-existing resume_from_checkpoint.
(This feature looks similar to auto-resubmit of pl's SLURM. but I am totally newbie about it, it could be nonsense.)

Pitch

current checkpoint restore process:
https://github.com/PyTorchLightning/pytorch-lightning/blob/3abfec896212ea85e45d6ac3ccb323ef242d16de/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L57-L60
It use (existing) resume_from_checkpoint.
If specified path's file is empty, it raise an error.

What I hope is

         1. HPC weights. 
         2. if no HPC weights ""try to"" restore checkpoint_path weights  
         3. otherwise don't restore weights 

It means that if checkpoint_path (resume_from_checkpoint) file do not exist, simply ignore it and start training from scratch.
In this case, training start normally from scratch, then pl save checkpoints.
If we set save-path == resume_from_checkpoint, latest checkpoint file exist in resume_from_checkpoint path.
When job auto-retry is triggered, because now checkpoint file exists in resume_from_checkpoint, in retried job pl load checkpoint from resume_from_checkpoint, so training properly resume.

Alternatives

Use hpc_save & hpc_load's resume system for normal training.
As far as I read the codes, "HPC weights load" (for slurm...?) enable auto-resubmit based on directory (not file) + file name rule (hpc_ckpt_{ckpt_number}.ckpt).
If we accept checkpoint directory (e.g. resume_from_checkpoint_dir), same mechanism can be used for resume/resubmit.

@tarepan tarepan added feature Is an improvement or enhancement help wanted Open to be worked on labels Oct 26, 2020
@SeanNaren SeanNaren added the checkpointing Related to checkpointing label Oct 26, 2020
tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 27, 2020
tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 27, 2020
tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 27, 2020
tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 27, 2020
@tarepan
Copy link
Contributor Author

tarepan commented Oct 28, 2020

Add draft pull request for discussion.

tarepan added a commit to tarepan/pytorch-lightning that referenced this issue Oct 30, 2020
@stale
Copy link

stale bot commented Nov 27, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 27, 2020
@stale stale bot closed this as completed Dec 4, 2020
@tarepan
Copy link
Contributor Author

tarepan commented Dec 4, 2020

wait for merge.

@SeanNaren SeanNaren reopened this Dec 9, 2020
@stale stale bot removed the won't fix This will not be worked on label Dec 9, 2020
Borda added a commit that referenced this issue Jan 5, 2021
…4402)

* Add empty resume_from_checkpoint acceptance #4366

* Fix general error catch with focused file check

* Add fsspec HTTP extras

Add fsspec's HTTPFileSystem  support through http extras.
pl has supported remote http file (e.g. #2925),
so this commit do not add new functionality.

* Fix potential too much logging in DDP

* Add PR changelog

* Add well-written argument explanation

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix DDP-compatible restore logging

Notify from where the states are restored.
This feature temporally deleted as a result of PR review.
With succeeding review, added with DDP compatibility.

* Fix utility import pathes

* Refactor load step commentaries

* Refactor hpc ckpt suffix acquisition

* Refactor restore/hpc_load match

* Refactor hpc load trial

* Refactor checkpoint dir check

* Refactor unneeded function nest

* Refactor nested If

* Refactor duplicated cache clear

* Refactor attempt flow with if/elif

* Fix pip8

* Refactor hook commentary

Co-authored-by: chaton <thomas@grid.ai>

* Fix pep8

* Refactor hpc load checkpoint path acquisition

* Fix pip8

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix doc

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Refactor None Union type with Optional

* Fix build-doc CI failure debuged in #5329

* Fix fsspec import during build-doc #5329

* Fix test epoch

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix test with latest test models

* .

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>
@tarepan
Copy link
Contributor Author

tarepan commented Jan 5, 2021

The PR is merged.
Thanks for all efforts of contributers.

@tarepan tarepan closed this as completed Jan 5, 2021
Borda pushed a commit that referenced this issue Jan 6, 2021
…4402)

* Add empty resume_from_checkpoint acceptance #4366

* Fix general error catch with focused file check

* Add fsspec HTTP extras

Add fsspec's HTTPFileSystem  support through http extras.
pl has supported remote http file (e.g. #2925),
so this commit do not add new functionality.

* Fix potential too much logging in DDP

* Add PR changelog

* Add well-written argument explanation

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix DDP-compatible restore logging

Notify from where the states are restored.
This feature temporally deleted as a result of PR review.
With succeeding review, added with DDP compatibility.

* Fix utility import pathes

* Refactor load step commentaries

* Refactor hpc ckpt suffix acquisition

* Refactor restore/hpc_load match

* Refactor hpc load trial

* Refactor checkpoint dir check

* Refactor unneeded function nest

* Refactor nested If

* Refactor duplicated cache clear

* Refactor attempt flow with if/elif

* Fix pip8

* Refactor hook commentary

Co-authored-by: chaton <thomas@grid.ai>

* Fix pep8

* Refactor hpc load checkpoint path acquisition

* Fix pip8

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix doc

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Refactor None Union type with Optional

* Fix build-doc CI failure debuged in #5329

* Fix fsspec import during build-doc #5329

* Fix test epoch

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix test with latest test models

* .

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>

(cherry picked from commit b0051e8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

2 participants