Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes resume_from_checkpoint warning to error #7075

Merged
merged 3 commits into from
Apr 28, 2021

Conversation

vballoli
Copy link
Contributor

@vballoli vballoli commented Apr 17, 2021

What does this PR do?

Raises a FileNotFoundError if an invalid checkpoint path is passed.

Fixes #7072

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • [] Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@vballoli vballoli changed the title Fix resume_from_checkpoint Changes resume_from_checkpoint warning to error Apr 17, 2021
@codecov
Copy link

codecov bot commented Apr 17, 2021

Codecov Report

Merging #7075 (cb021e4) into master (7b0b0d2) will decrease coverage by 7%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #7075     +/-   ##
========================================
- Coverage      92%     86%     -7%     
========================================
  Files         196     199      +3     
  Lines       12569   15504   +2935     
========================================
+ Hits        11590   13257   +1667     
- Misses        979    2247   +1268     

@Borda Borda added checkpointing Related to checkpointing feature Is an improvement or enhancement labels Apr 18, 2021
@Borda Borda added this to the 1.3 milestone Apr 18, 2021
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Mind adding a simple test for this?

@vballoli
Copy link
Contributor Author

vballoli commented Apr 18, 2021

I've fixed the test for this @ananthsub at tests/models/test_restore.py, at L134 def test_try_resume_from_non_existing_checkpoint(tmpdir). @ananthsub

@vballoli vballoli requested a review from ananthsub April 18, 2021 16:55
@awaelchli
Copy link
Member

awaelchli commented Apr 18, 2021

It actually looks like the previous behavior was expected, as you can see there is even a test there and the docstring of the test says it explicitly that no error should occur :)
So either keep the current or change the test.

@vballoli
Copy link
Contributor Author

@awaelchli I have changed the test to expect a FileNotFoundError.

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @tarepan who contributed this in #4402

This would break your workflow, correct?

Should we just try to make the warning more visible instead?

tests/models/test_restore.py Outdated Show resolved Hide resolved
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
@tarepan
Copy link
Contributor

tarepan commented Apr 22, 2021

@carmocca sorry for late reply.
As you mentioned, this behavior (start when ckptDir is empty ) is intended for auto-resume.
More explicit warning will be needed (sorry @vballoli , I should explicitly write this behavior before you hit this problem).

@awaelchli
Copy link
Member

awaelchli commented Apr 22, 2021

We could put an additional warning/info into the docs for more visibility.

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, instead update the warning and/or docs as you see fit to have better visibility 👍

cc: @dsquareindia who created the issue and might have an idea of how this could be clearer

@vballoli
Copy link
Contributor Author

I'm not sure how to better update visibility since I believe the previous message was clear enough if that was the expected behavior. imo if the checkpoint isn't found, I thought it should've been an error instead of a warning since it's a direct conflict with an explicit argument not found because of some error in specifying a location.

@devashishd12
Copy link

It's mostly a design decision but I do agree with @vballoli in that if there is an incorrect argument being passed, there should be an error. That being said, it's of course not great to break existing workflows and I can't fully judge the measure of the downstream impact this change would have; something that the maintainers can do. If the latter approach is what's preferred, I can't think of anything else other than maybe making the warning like ###### WARNING: CHECKPOINT FILE NOT FOUND. TRAINING FROM SCRATCH. ######## just to make it stand out a little. I had just thought of reporting this "issue" because I had found this behaviour odd since in almost all the other libraries there would a FileNotFoundError. In any case thanks for the help :D

@tchaton
Copy link
Contributor

tchaton commented Apr 28, 2021

Hey @awaelchli @carmocca,

I will tend to agree with @vballoli on this one.

If a user explicitly provides a resume_from_checkpoint to the Trainer and the checkpoint doesn't exist,
the user would expect the training to crash and not just re-start from scratch (which has a some cost).
They will get frustrated if they lost money due to this un-wanted behaviour.
Furthermore, users could use this to have the right behaviour.

Trainer(resume_from_checkpoint=checkpoint_path if os.path.exists(checkpoint_path) else None)

Best,
T.C

@carmocca
Copy link
Contributor

I agree too. But unfortunately, #4402 was merged and we overlooked this which means changing this is going to have BC implications for users. So we have 3 options going forward:

  1. Keep as is.
  2. Revert to an error (The current changes in this PR).
  3. Show a warning (deprecate) if this condition happens and in the future switch the behaviour to an error.

What do you prefer?

@tarepan
Copy link
Contributor

tarepan commented Apr 28, 2021

I basically agree the discussion of BC-break change.
In that case, where we move the auto-resume feature?

Even before #4402 PR, Pytorch-Lightning have been had auto-resume feature for SLURM (docs).
I made #4402 PR in order to extends this SLURM specific feature into general situation.

If we totally remove auto-resume feature, it will break long-used (& of course useful) feature.
In my opinion, If we remove auto-resume with resume_from_checkpoint, we need another argument (e.g. auto_resume_from).

@williamFalcon
Copy link
Contributor

Hey @awaelchli @carmocca,

I will tend to agree with @vballoli on this one.

If a user explicitly provides a resume_from_checkpoint to the Trainer and the checkpoint doesn't exist,
the user would expect the training to crash and not just re-start from scratch (which has a some cost).
They will get frustrated if they lost money due to this un-wanted behaviour.
Furthermore, users could use this to have the right behaviour.

Trainer(resume_from_checkpoint=checkpoint_path if os.path.exists(checkpoint_path) else None)

Best,
T.C

yes... 100% agree here.

@williamFalcon
Copy link
Contributor

I agree too. But unfortunately, #4402 was merged and we overlooked this which means changing this is going to have BC implications for users. So we have 3 options going forward:

  1. Keep as is.
  2. Revert to an error (The current changes in this PR).
  3. Show a warning (deprecate) if this condition happens and in the future switch the behaviour to an error.

What do you prefer?

i

@williamFalcon
Copy link
Contributor

sorry, meant to cancel a comment not close it.

was going to agree with @carmocca’s option 2

@carmocca
Copy link
Contributor

If we totally remove auto-resume feature, it will break long-used (& of course useful) feature.
In my opinion, If we remove auto-resume with resume_from_checkpoint, we need another argument (e.g. auto_resume_from)

@tarepan, can't you rely on the check Thomas proposed?:

Trainer(resume_from_checkpoint=checkpoint_path if os.path.exists(checkpoint_path) else None)

Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as we are going with option (2)

@carmocca carmocca added the design Includes a design discussion label Apr 28, 2021
@tarepan
Copy link
Contributor

tarepan commented Apr 28, 2021

@carmocca
I got it!
Is my below understanding correct?

CKPT_PATH = "/~~~~"
ckpt_callback = ModelCheckpoint(dirpath=CKPT_PATH, ...)
nullable_ckpt = CKPT_PATH if get_filesystem(CKPT_PATH).exists(CKPT_PATH) else None
Trainer(callbacks=[ckpt_callback], resume_from_checkpoint=nullable_ckpt)

With save & Nullable-load, we can auto-resume. It's cool, thanks @tchaton.
I agree with Option2.

@carmocca carmocca merged commit ccd87ca into Lightning-AI:master Apr 28, 2021
@awaelchli
Copy link
Member

Hi, the test can be updated and because this is change is backward incompatible, we should not forget the changelog entry. I will send a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing design Includes a design discussion feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

resume_from_checkpoint should not start from scratch if ckpt is not found
9 participants