Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-existing resume_from_checkpoint acceptance for auto-resubmit #4402

Merged

Conversation

tarepan
Copy link
Contributor

@tarepan tarepan commented Oct 28, 2020

What does this PR do?

Enable resume_from_checkpoint file == empty in Trainer for auto-resume/resubmit.
Feature #4366

Before submitting

  • proposed Not-yet-existing resume_from_checkpoint for auto-resubmit #4366 <= Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 👍

@tarepan tarepan changed the title Add empty resume_from_checkpoint acceptance for auto-resubmit Add non-existing resume_from_checkpoint acceptance for auto-resubmit Oct 28, 2020
@codecov
Copy link

codecov bot commented Oct 28, 2020

Codecov Report

Merging #4402 (743fe31) into master (ab7512d) will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4402   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         134     134           
  Lines        9950    9955    +5     
======================================
+ Hits         9261    9266    +5     
  Misses        689     689           

@awaelchli awaelchli added checkpointing Related to checkpointing feature Is an improvement or enhancement labels Oct 30, 2020
Add fsspec's HTTPFileSystem  support through http extras.
pl has supported remote http file (e.g. Lightning-AI#2925),
so this commit do not add new functionality.
@tarepan tarepan force-pushed the feature/4366_non_existing_checkpoint branch from a752190 to 1b29438 Compare October 30, 2020 13:04
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition !

@tarepan
Copy link
Contributor Author

tarepan commented Nov 5, 2020

@justusschock Very sorry, I mistakenly make re-review request to you...

@Borda Borda enabled auto-merge (squash) December 23, 2020 16:22
@tarepan
Copy link
Contributor Author

tarepan commented Dec 23, 2020

To anyone:
I encountered weird build-Docs CI error.
I failed to fix this error, so we cannot merge this PR.
Please help.

Summary

build-Docs failed with import error.
This PR did not touch any import statement or directory structure.
I cannot find the cause of this error.

Error

CircleCI build-Docs

[autosummary] generating autosummary for: BECOMING_A_CORE_CONTRIBUTOR.md, ...(※ abbreviated by me)
Failed to import 'pytorch_lightning.profiler': no module named pytorch_lightning.profiler

Warning, treated as error:
[autosummary] failed to import 'pytorch_lightning.callbacks.Callback': no module named pytorch_lightning.callbacks.Callback
Makefile:19: recipe for target 'html' failed
make: *** [html] Error 2

Exited with code exit status 2
CircleCI received exit code 2

Servey

This PR follow the master branch, and do not touch any module import matters.
I guess that Sphinx through Make could have some relationship with this error.
For example, rebuild of long-untouch-file could reveal embedded/hidden bugs (because Sphinx through Make reuse unchanged files).

@s-rog s-rog disabled auto-merge December 28, 2020 03:22
@tarepan tarepan mentioned this pull request Jan 2, 2021
1 task
@tarepan
Copy link
Contributor Author

tarepan commented Jan 2, 2021

The Build-Docs CI bug is fixed.

Copy link
Member

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Borda it goes to 1.2 branch yes?

tests/models/test_restore.py Outdated Show resolved Hide resolved
requirements/docs.txt Outdated Show resolved Hide resolved
tarepan and others added 3 commits January 3, 2021 06:15
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
@Borda Borda enabled auto-merge (squash) January 4, 2021 19:55
@tarepan
Copy link
Contributor Author

tarepan commented Jan 4, 2021

Ready to Merge.

2 Red CIs are not from this PR.

  • CI build Docker / build-conda (3.8, 1.6)
  • CI build Docker / build-conda (3.7, 1.7)

The cause of both CI is Horovod build with GPU support was requested but this PyTorch installation does not support CUDA.

@Borda Borda merged commit b0051e8 into Lightning-AI:master Jan 5, 2021
@tarepan tarepan deleted the feature/4366_non_existing_checkpoint branch January 5, 2021 02:38
Borda pushed a commit that referenced this pull request Jan 6, 2021
…4402)

* Add empty resume_from_checkpoint acceptance #4366

* Fix general error catch with focused file check

* Add fsspec HTTP extras

Add fsspec's HTTPFileSystem  support through http extras.
pl has supported remote http file (e.g. #2925),
so this commit do not add new functionality.

* Fix potential too much logging in DDP

* Add PR changelog

* Add well-written argument explanation

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix DDP-compatible restore logging

Notify from where the states are restored.
This feature temporally deleted as a result of PR review.
With succeeding review, added with DDP compatibility.

* Fix utility import pathes

* Refactor load step commentaries

* Refactor hpc ckpt suffix acquisition

* Refactor restore/hpc_load match

* Refactor hpc load trial

* Refactor checkpoint dir check

* Refactor unneeded function nest

* Refactor nested If

* Refactor duplicated cache clear

* Refactor attempt flow with if/elif

* Fix pip8

* Refactor hook commentary

Co-authored-by: chaton <thomas@grid.ai>

* Fix pep8

* Refactor hpc load checkpoint path acquisition

* Fix pip8

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix typo

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix doc

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Refactor None Union type with Optional

* Fix build-doc CI failure debuged in #5329

* Fix fsspec import during build-doc #5329

* Fix test epoch

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Fix test with latest test models

* .

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>
Co-authored-by: Roger Shieh <sh.rog@protonmail.ch>

(cherry picked from commit b0051e8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing design Includes a design discussion feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants