Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add current_epoch to dumped_params #3261

Merged
merged 6 commits into from
Oct 6, 2020
Merged

add current_epoch to dumped_params #3261

merged 6 commits into from
Oct 6, 2020

Conversation

maxjeblick
Copy link
Contributor

What does this PR do?

Fixes #3260 by restoring the current_epoch after finishing the batch size finder.

@mergify mergify bot requested a review from a team August 29, 2020 22:20
@codecov
Copy link

codecov bot commented Aug 29, 2020

Codecov Report

Merging #3261 into master will increase coverage by 1%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #3261    +/-   ##
=======================================
+ Coverage      86%     87%    +1%     
=======================================
  Files         117     117            
  Lines        9353    9618   +265     
=======================================
+ Hits         8074    8357   +283     
+ Misses       1279    1261    -18     

@rohitgr7
Copy link
Contributor

rohitgr7 commented Aug 30, 2020

global_step, total_batch_idx, or maybe more attributes need to be reset too. I believe a better solution would be to make a pre_fit method or something in Trainer and reassign them to their default values every time .fit is called by calling it in .fit itself.

@awaelchli
Copy link
Member

@rohitgr7 There is a Tuner class coming, #3160 in which I believe the resetting problem will be tackled.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Aug 30, 2020

Nice!! @awaelchli will it support user-defined callbacks for lr_find and auto_scale_batch_size?

@awaelchli
Copy link
Member

I'm not sure, but we recently added support to persist state of callbacks, so we should be able to dump their state too and restore them easily if that's the problem? For further questions better ask directly here #3160

@awaelchli awaelchli added the bug Something isn't working label Aug 30, 2020
@maxjeblick
Copy link
Contributor Author

The Tuner class looks really promising!

@SkafteNicki
Copy link
Member

Just to update, the HyperTuner class is not going to happen at the moment, instead the tuner algorithms will be moved to a separate trainer.tune() method instead of being called by fit (it has already started #3293). So we need to fix this problem when the refactor is done.

@mergify
Copy link
Contributor

mergify bot commented Sep 7, 2020

This pull request is now in conflict... :(

@maxjeblick maxjeblick closed this Sep 7, 2020
@Borda Borda reopened this Oct 2, 2020
@Borda
Copy link
Member

Borda commented Oct 2, 2020

@maxjeblick mind add a test to check that it is fixed...

@Borda Borda added this to the 1.0 milestone Oct 5, 2020
Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SkafteNicki SkafteNicki added the ready PRs ready to be merged label Oct 6, 2020
@mergify mergify bot requested a review from a team October 6, 2020 08:43
Copy link
Member

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are probably lots of other trainer attributes that need to be dumped, for example global_step?

CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
@SkafteNicki
Copy link
Member

@awaelchli you are probably right. I think the tuner algorithms needs an refactoring after v1.0.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Oct 6, 2020

Shouldn't tuner be refactored in such a way that it won't call .fit again? Then we won't need to dump and restore anything. What it should do is just suggest and user should change the values and rerun?

@SkafteNicki
Copy link
Member

@rohitgr7 we use fit internally in these algorithms so we don´t have to redefine the training loop necessary to run these algorithms. However, IMO moving forward we should refactor it in such a way that tuner should just create a new trainer that can be used during tuning and then destroyed afterwards.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Oct 6, 2020

@SkafteNicki I am not suggesting to rewrite anything. The reason we need to dump and restore is because these attributes are not re-initialized and current workflow is something like:

trainer.tuner.lr_find()
# suggest
trainer.fit()

What I am suggesting is:

trainer.tuner.lr_find()
# suggest
# and tell user to reinitalize trainer, LM, and LDM(if any) with the updated hparams from the suggestions.

IMO moving forward we should refactor it in such a way that tuner should just create a new trainer that can be used during tuning and then destroyed afterwards.

this is good alternative but this will work for only Trainer and not for LM and LDM since there are chances that some attributes might be changed during .fit().

@Borda Borda merged commit 39b3704 into Lightning-AI:master Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

auto_scale_batch_size won't reset current_epoch
6 participants