Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ModelCheckpoints name formatting #3163

Merged
merged 19 commits into from
Sep 18, 2020
Merged

Fix ModelCheckpoints name formatting #3163

merged 19 commits into from
Sep 18, 2020

Conversation

carmocca
Copy link
Contributor

@carmocca carmocca commented Aug 25, 2020

What does this PR do?

  • Fixes several typos.
  • Adds CHECKPOINT_JOIN_CHAR, CHECKPOINT_SUFFIX variables for customization.
  • Allows using templates in the last checkpoint saved.
  • Adds a test.

Fixes #3125
Fixes #2583
Closes #2584

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

@pep8speaks
Copy link

pep8speaks commented Aug 25, 2020

Hello @carmocca! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-18 13:13:23 UTC

@mergify mergify bot requested a review from a team August 25, 2020 15:41
@Borda Borda added bug Something isn't working feature Is an improvement or enhancement labels Aug 25, 2020
@Borda Borda added this to the 0.9.x milestone Aug 25, 2020
@mergify mergify bot requested a review from a team August 25, 2020 15:53
@carmocca
Copy link
Contributor Author

Should we add the possibility of modifying these variables to the doc-string?

@codecov
Copy link

codecov bot commented Aug 25, 2020

Codecov Report

Merging #3163 into master will decrease coverage by 0%.
The diff coverage is 97%.

@@          Coverage Diff           @@
##           master   #3163   +/-   ##
======================================
- Coverage      91%     91%   -0%     
======================================
  Files         109     109           
  Lines        8031    8066   +35     
======================================
+ Hits         7291    7311   +20     
- Misses        740     755   +15     

@Borda
Copy link
Member

Borda commented Aug 25, 2020

Should we add the possibility of modifying these variables to the doc-string?

which variables do you mean?

@carmocca
Copy link
Contributor Author

carmocca commented Aug 25, 2020

CHECKPOINT_JOIN_CHAR, CHECKPOINT_NAME_LAST, and CHECKPOINT_SUFFIX

@carmocca
Copy link
Contributor Author

I believe this would also solve #2583 #2584 if merged

@mergify mergify bot requested a review from a team August 25, 2020 18:12
@rohitgr7
Copy link
Contributor

I believe this would also solve #2583 #2584 if merged

First, we need to clear up whether we want a separate filename parameter or not, because #2583 is a different issue. I don't think this PR solves that problem.

@carmocca
Copy link
Contributor Author

carmocca commented Aug 25, 2020

First, we need to clear up whether we want a separate filename parameter or not, because #2583 is a different issue. I don't think this PR solves that problem.

With the changes in this PR, one can do:

checkpoint_callback = pl.callbacks.ModelCheckpoint(filepath="my/file/path/name")
# Generates checkpoints called "name.ckpt" in my/file/path

But if filepath is an existing directory:

checkpoint_callback = pl.callbacks.ModelCheckpoint(filepath="my/file/path")
# Generates checkpoints called "epoch={epoch}.ckpt" in my/file/path
# note: this is the previous default behaviour

So there is no need of a filename parameter.

@ydcjeff
Copy link
Contributor

ydcjeff commented Aug 26, 2020

can we have a test for this r = ModelCheckpoint._format_checkpoint_name('path/to/ckpt', 3, {}, prefix='test') as well?

@carmocca
Copy link
Contributor Author

carmocca commented Aug 27, 2020

can we have a test for this r = ModelCheckpoint._format_checkpoint_name('path/to/ckpt', 3, {}, prefix='test') as well?

_format_checkpoint_name's first parameter is assumed to be a filename. Your example passes a filepath so the result would be r == "test-path/to/ckpt". I added tests using filepaths with format_checkpoint_name(no leading _).

Comment on lines 406 to 394
self._del_model(self.last_model_path)
self._save_model(filepath, trainer, pl_module)
self.last_model_path = filepath
Copy link
Member

@awaelchli awaelchli Aug 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we (in a new PR pls) first rename the model to "oldname.ckpt.backup" before saving the model, and then delete the backup file after save has completed? Otherwise, if a crash happens during save_model, we will have lost the old checkpoint and are only left with a corrupted file for the current.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new PR based on master or this branch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, just a small suggestion after this PR is merged. but it will touch other locations as well, for example topk saving.

@mergify mergify bot requested a review from a team August 29, 2020 12:19
@awaelchli
Copy link
Member

I prefer this PR over the one that adds an extra filename parameter.

@rohitgr7
Copy link
Contributor

So there is no need of a filename parameter.

The reason I suggested separate dirpath and filename parameter is because it gives the functionality to get the default versioning of checkpoints out-of-the-box as if dirpath is set to None and filename is defined then ModelCheckpoint will use the versioning functionality of checkpoints in lightning_logs/version_num. Also, resolve some issues that are handled by his PR and is a bit easy to use and manipulate checkpoints for users.

@ananthsub
Copy link
Contributor

ananthsub commented Aug 31, 2020

Regarding this:

# Generates checkpoints called "name.ckpt" in my/file/path
checkpoint_callback = pl.callbacks.ModelCheckpoint(filepath="my/file/path/name")

This may not be the only intention for the filepath. For instance, the user could also expect the callback to save checkpoints to directory my/file/path/name with name {epoch}.ckpt. This is hard to distinguish with filepath being overloaded. It's also confusing because you could pass a yet non-existent directory path as filepath to the checkpoint callback and expect Lightning to make the directory to save checkpoints to. Parsing whether the last component should be part of the directory or the checkpoint name is impossible to do right, and suggests better support from the API arguments. IMO splitting filepath into 2 args, dirpath and file_template_name (with default value = "{epoch}"), would resolve this nicely.

@carmocca
Copy link
Contributor Author

I am not against the idea of splitting filepath into dirpath and filename, however, that discussion is out of scope of this PR.

Considering that this PR does not change the API, we could merge this and then (possibly?) work on the filepath improvement.

@carmocca
Copy link
Contributor Author

carmocca commented Sep 3, 2020

Anything holding this back?

There will be some conflicts with #3320. Should be easy to fix

@Borda
Copy link
Member

Borda commented Sep 3, 2020

Let's process it...

@ydcjeff
Copy link
Contributor

ydcjeff commented Sep 3, 2020

Just found out this
Shall we fix this typo in this PR too?
https://github.com/PyTorchLightning/pytorch-lightning/blob/83e51def23ecf98c7c4d98a212b52af4c0a40515/pytorch_lightning/callbacks/early_stopping.py#L112

Should be validation_epoch_end

@Borda
Copy link
Member

Borda commented Sep 3, 2020

Shall we fix this typo in this PR too?

yes, pls

@carmocca
Copy link
Contributor Author

carmocca commented Sep 3, 2020

@ydcjeff @Borda done!

@mergify
Copy link
Contributor

mergify bot commented Sep 3, 2020

This pull request is now in conflict... :(

tests/callbacks/test_model_checkpoint.py Outdated Show resolved Hide resolved
tests/callbacks/test_model_checkpoint.py Outdated Show resolved Hide resolved
tests/callbacks/test_model_checkpoint.py Outdated Show resolved Hide resolved
tests/trainer/test_trainer.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team September 17, 2020 19:51
@mergify mergify bot requested a review from a team September 17, 2020 20:07
Copy link
Member

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great improvement. Thanks for your effort.
Please note for the future, this PR has some independent changes that, although they are good to have, they could have been split over multiple PR's.

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved
@Borda Borda added the ready PRs ready to be merged label Sep 18, 2020
@Borda Borda merged commit 580b04b into Lightning-AI:master Sep 18, 2020
@awaelchli
Copy link
Member

awaelchli commented Sep 21, 2020

Parsing whether the last component should be part of the directory or the checkpoint name is impossible to do right, and suggests better support from the API arguments. IMO splitting filepath into 2 args, dirpath and file_template_name (with default value = "{epoch}"), would resolve this nicely.

as mentioned by @ananthsub it is not possible to distinguish between filename and folder, this bug: #3001
My suggestion was to force user to choose a file extension so that we know when it is a file. But perhaps a separate arg is nicer.
Also there is a 2nd bug which complicates things, self._fs.isdir checks for a real existing dir, but we want to accept a folder name that does not exist yet.

Either way, to fix this I don't see a good way that keeps back compatibility. If back compatibility is unavoidable, I feel like now with the v1 release would be a good time to introduce this breaking change. If not, I think we should at least update docs.
@PyTorchLightning/core-contributors

@carmocca
Copy link
Contributor Author

carmocca commented Sep 21, 2020

Either way, to fix this I don't see a good way that keeps back compatibility.

Cant we keep back compat by having filepath (old parameter) and dirpath, filename (new parameters)?

  • if filepath is set: old logic, show deprecated warning.
  • if dirpath and filename are set: new logic.
  • if filepath, dirpath, and filename are set: raise MisconfigurationException or use new logic

BTW this discussion should continue elsewhere. Maybe in the issue you linked.

@Borda
Copy link
Member

Borda commented Sep 21, 2020

Also there is a 2nd bug which complicates things, self._fs.isdir checks for a real existing dir, but we want to accept a folder name that does not exist yet.

then we can do some bubble up testing to find the existing folder in upstream and create the remaining?

Either way, to fix this I don't see a good way that keeps back compatibility. If back compatibility is unavoidable, I feel like now with the v1 release would be a good time to introduce this breaking change. If not, I think we should at least update docs.

yes, 1.0 is a good place to do it...

@Borda
Copy link
Member

Borda commented Sep 21, 2020

BTW this discussion should continue elsewhere. Maybe in the issue you linked.

@carmocca mind creating a new issue for this discussion and link it here...

@carmocca carmocca deleted the modelcheckpoint-naming branch September 22, 2020 18:33
@carmocca carmocca self-assigned this Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Model checkpoint naming consistency Change checkpoint filename without overwriting full filepath
8 participants