Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start accumulate gradients schedule at epoch 0 #2490

Closed
wants to merge 2,642 commits into from
Closed

Start accumulate gradients schedule at epoch 0 #2490

wants to merge 2,642 commits into from

Conversation

HHousen
Copy link
Contributor

@HHousen HHousen commented Jul 3, 2020

What does this PR do?

Fixes #2480. If the pl.Trainer option accumulate_grad_batches was an integer then the first epoch (epoch 0) would use accumulate_grad_batches=1 while the remaining epochs would use accumulate_grad_batches=<user_value>. This was caused by #2289.

Before:
wrong

After:
correct

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

lezwon and others added 30 commits May 17, 2020 16:30
* added tpu_id

added tpu_id to mixins

* train on individual tpu

* parallel loader if tpu_id is None

* removed progress_bar_refresh_rate

* chlog

* replaced num_tpu_cores with tpu_cores

* set tpu_id to None if int

* changed num_tpu_cores to tpu_cores in docs

* updated docs

* updated __init__.py
removed self.tpu_id for ParallelLoader

* Update pytorch_lightning/trainer/__init__.py

* check if tpu_cores is a list

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* xla device conditional

* num_tpu_cores deprecation

* removed duplicate warning

* fixed pep8 error

* Revert "removed duplicate warning"

This reverts commit 8adb0a9

* deprecated api update

* fixed recursion error

* fixed tests

* fixed flake errors

* removed current_tpu_index

* Update CHANGELOG.md

* Update trainer.py

Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
* fix codecov

* upgrade codecov

* upgrade codecov
* Remove NaNs from loss in LRFinder

* np.isfinite

* chlog

* add test

* chlog

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
* New metric classes (#1326)

* Create metrics package

* Create metric.py

* Create utils.py

* Create __init__.py

* add tests for metric utils

* add docstrings for metrics utils

* add function to recursively apply other function to collection

* add tests for this function

* update test

* Update pytorch_lightning/metrics/metric.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* update metric name

* remove example docs

* fix tests

* add metric tests

* fix to tensor conversion

* fix apply to collection

* Update CHANGELOG.md

* Update pytorch_lightning/metrics/metric.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* remove tests from init

* add missing type annotations

* rename utils to convertors

* Create metrics.rst

* Update index.rst

* Update index.rst

* Update pytorch_lightning/metrics/convertors.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/metrics/convertors.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/metrics/convertors.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update pytorch_lightning/metrics/metric.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/utilities/test_apply_to_collection.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/utilities/test_apply_to_collection.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Update tests/metrics/convertors.py

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Apply suggestions from code review

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* add doctest example

* rename file and fix imports

* added parametrized test

* replace lambda with inlined function

* rename apply_to_collection to apply_func

* Separated class description from init args

* Apply suggestions from code review

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* adjust random values

* suppress output when seeding

* remove gpu from doctest

* Add requested changes and add ellipsis for doctest

* forgot to push these files...

* add explicit check for dtype to convert to

* fix ddp tests

* remove explicit ddp destruction

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* move dtype device mixin to more general place

* refactor to general device dtype mixin

* add initial metric package description

* change default to none for mac os

* pep8

* fix import

* Update index.rst

* Update ci-testing.yml

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update CHANGELOG.md

* Update pytorch_lightning/metrics/converters.py

* readme

* Update metric.py

* Update pytorch_lightning/metrics/converters.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
* Allow dataloaders without sampler field present

Sometimes we have a custom dataloader that doesn't have a sampler, better to check that the field is there before reading it.

* chlog

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
* suppress epub warn

* fail on warn

* suppress epub warn

* fail on warn

* disable epub

* typo

* remove obsolete suppress
* set min PT 1.3

* circleCI

* mergify

* min

* chlog

* skip
* fix user error produced by apex + scheduler combination

* add changelog

* added reinit to every configure_apex call

* fix styling

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
* readme

* add governance.rst

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
* remove the need for hparams

* remove the need for hparams

* remove the need for hparams

* remove the need for hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* replace self.hparams

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* fixed

* finished moco

* basic

* testing

* todo

* recurse

* hparams

* persist

* hparams

* chlog

* tests

* tests

* tests

* tests

* tests

* tests

* review

* saving

* tests

* tests

* tests

* docs

* finished moco

* hparams

* review

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* hparams

* overwrite

* transform

* transform

* transform

* transform

* cleaning

* cleaning

* tests

* examples

* examples

* examples

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* chp key

* tests

* Apply suggestions from code review

* class

* updated docs

* updated docs

* updated docs

* updated docs

* save

* wip

* fix

* flake8

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
* update logger imports

* pep8 fixes

* pep8
* tests, fix logger bug and prepare data bug

* add CHANGELOG.md

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
* saves model every epoch

* implement test for save_last

* Update CHANGELOG.md

* Update CHANGELOG.md

* changes test description

Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com>

Co-authored-by: Jeremy Jordan <13970565+jeremyjordan@users.noreply.github.com>
* wip protected progress bar settings

* remove callback attr from LRfinder

* whitespace

* changelog
* Fixes #490

`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. 
In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.

* Highlighted that ES callback runs on val epochs in docstring

* Updated EarlyStopping in rst doc

* Update early_stopping.py

* Update early_stopping.rst

* Update early_stopping.rst

* Update early_stopping.rst

* Update early_stopping.rst

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update docs/source/early_stopping.rst

* fix doctest indentation warning

* Train loop calls early_stop.on_validation_end

* chlog

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
* filter valid args

* error on unknown manual args

* added test

* changelog

* update docs and doctest

* simplify

* doctest

* doctest

* doctest

* better test with mock check for init call

* fstring

* extend test

* skip test on 3.6 not working

Co-authored-by: William Falcon <waf2107@columbia.edu>
* updated docs

* added mixed

* added mixed
* fix default

* formatting errors

* update

* flake8
* Removing unecessary early stopping calls

* Update CHANGELOG.md

Co-authored-by: Mateusz Pieniak <mateusz.pieniak@evidenceprime.com>
Co-authored-by: William Falcon <waf2107@columbia.edu>
* fixed undesired behaviour due to dict.fromkeys

* a test for log length consistency

* runtime-warn if no schedulers are configured

* chlog

* move

Co-authored-by: Jirka <jirka@pytorchlightning.ai>
* unify tests

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
@awaelchli
Copy link
Contributor

Hi, thanks for the PR. The test that you changed is for LRFinder, it looks ok, but do you have an idea why the other test test_gradient_accumulation_scheduling in test_trainer.py did not fail? It seems like an important test but it is invariant to this change. It might be because this test has a very specific example in which this is not detected. I wonder if it could be parameterized to cover several cases.

Comment on lines 157 to 158
assert lrfinder._total_batch_idx == 100 * 2, \
'Accumulation parameter did not work'
Copy link
Contributor

@awaelchli awaelchli Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert lrfinder._total_batch_idx == 100 * 2, \
'Accumulation parameter did not work'
assert lrfinder._total_batch_idx == 100 * 2, \
'Accumulation parameter did not work'
assert trainer.total_batch_idx == 0

What do you think? Should we also add this assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this essentially ensure that the trainer never starts to train the model?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, since there is accumulation happening in trainer and lr_finder, I would just include this assert for clarity.

@mergify mergify bot requested a review from a team July 3, 2020 23:46
@Borda
Copy link
Member

Borda commented Jul 3, 2020

The tests/trainer/test_lr_finder.py::test_accumulation_and_early_stopping test failed because the _total_batch_idx assert value was changed in #2375 from 100 * 2 to 190. I think this change was a mistake. The _total_batch_idx changed to 190 because the first epoch had accumulate_grad_batches set to 1. Since this pull request fixes the accumulate_grad_batches bug for the first epoch, I have changed this test back to the previous value.

this test has a bug, it shall be 200, pls correct it :]

@Borda Borda added the bug Something isn't working label Jul 3, 2020
@mergify mergify bot requested a review from a team July 3, 2020 23:56
@Borda Borda added the ready PRs ready to be merged label Jul 3, 2020
@Borda Borda added this to the 0.8.x milestone Jul 3, 2020
@HHousen
Copy link
Contributor Author

HHousen commented Jul 4, 2020

Hi, thanks for the PR. The test that you changed is for LRFinder, it looks ok, but do you have an idea why the other test test_gradient_accumulation_scheduling in test_trainer.py did not fail? It seems like an important test but it is invariant to this change. It might be because this test has a very specific example in which this is not detected. I wonder if it could be parameterized to cover several cases.

Prior to this pull request the test_gradient_accumulation_scheduling test did not fail because the schedule is {1: 2, 3: 4}, which falls under the dict section of the if...else block in configure_accumulated_gradients(). In the GradientAccumulationScheduler callback the line scheduling.update({0: 1}) set the accumulate_grad_batches to 1 for the first epoch, which is tested in test_gradient_accumulation_scheduling in the line assert self.trainer.accumulate_grad_batches == 1 which will pass. This did not raise an error because the 0th epoch in the {1: 2, 3: 4} schedule is not set and because a dict was tested, not an int. If an int were tested (lets say 3) and the asserts were updated to always check for a value of 3, then the test would have failed because epoch zero would have accumulate_grad_batches=1. Epoch zero would have accumulate_grad_batches=1 because the line schedule = {1: accumulate_grad_batches} in configure_accumulated_gradients() did not define the first epoch, which scheduling.update({0: 1}) in GradientAccumulationScheduler sets to 1.

After this pull request the test_gradient_accumulation_scheduling test did not fail either because I modified the code in the if...else block in configure_accumulated_gradients() that handles ints, but a dict is tested in this test.

The test_gradient_accumulation_scheduling test should check an int as the value for accumulate_grad_batches. This addition to the test would have caught this error.

@awaelchli
Copy link
Contributor

awaelchli commented Jul 4, 2020

@HHousen would you like to modify that test so that it would fail on master and pass here? Otherwise the error could sneak back in the future without us noticing :) I can help if you need

@HHousen
Copy link
Contributor Author

HHousen commented Jul 4, 2020

@HHousen would you like to modify that test so that it would fail on master and pass here? Otherwise the error could sneak back in the future without us noticing :) I can help if you need

Sure. I'll parametrize the test so an int can be tested as well as a dict.

@HHousen
Copy link
Contributor Author

HHousen commented Jul 4, 2020

In the else block of _optimizer_step in the current test, why is 3 added to self.prev_called_batch_idx. Shouldn't 4 be added since the accumulate_grad_batches is 4?

@pep8speaks
Copy link

pep8speaks commented Jul 4, 2020

Hello @HHousen! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-05 00:05:47 UTC

@HHousen
Copy link
Contributor Author

HHousen commented Jul 4, 2020

In the else block of _optimizer_step in the current test, why is 3 added to self.prev_called_batch_idx. Shouldn't 4 be added since the accumulate_grad_batches is 4?

Nevermind, it should be 4.

@HHousen
Copy link
Contributor Author

HHousen commented Jul 4, 2020

I updated the test_trainer.py::test_gradient_accumulation_scheduling test. It had a few issues that prevented it from properly testing what it was designed to test. Namely, the batches per epoch was too low and only two epochs were processed but there were tests designed for greater than 3 epochs. I have updated and parametrized the test. The master branch should now have 2 fail and 1 succeed, whereas this pull request should have 3/3 succeed.

# for the test
trainer.optimizer_step = _optimizer_step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that explains a lot :) so the test was never really running the function?
Good that you saw this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

@mergify mergify bot requested a review from a team July 4, 2020 23:36
@HHousen HHousen closed this Jul 5, 2020
@williamFalcon
Copy link
Contributor

@HHousen solid catch!!

@HHousen
Copy link
Contributor Author

HHousen commented Jul 5, 2020

Sorry. I tried to remove the two dataset files that were downloaded when I ran the tests since I accidentally committed them. This did not work since force-pushing closed this pull request. I undid the force-push by reverting back to e0cb028, but I still can't reopen since GitHub says the branch was "force-pushed or recreated."

git reflog show bugfix/2480_grad-accumulate-0-epoch:

e0cb0285 (HEAD -> bugfix/2480_grad-accumulate-0-epoch, origin/bugfix/2480_grad-accumulate-0-epoch, refs/original/refs/heads/bugfix/2480_grad-accumulate-0-epoch) bugfix/2480_grad-accumulate-0-epoch@{0}: reset: moving to origin/bugfix/2480_grad-accumulate-0-epoch@{1}
c3b8cca4 bugfix/2480_grad-accumulate-0-epoch@{1}: filter-branch: rewrite
e0cb0285 (HEAD -> bugfix/2480_grad-accumulate-0-epoch, origin/bugfix/2480_grad-accumulate-0-epoch, refs/original/refs/heads/bugfix/2480_grad-accumulate-0-epoch) bugfix/2480_grad-accumulate-0-epoch@{2}: commit: Fix pep8 formatting
6420265c bugfix/2480_grad-accumulate-0-epoch@{3}: commit: Update test_trainer.py::test_gradient_accumulation_scheduling
f1f9b6f8 bugfix/2480_grad-accumulate-0-epoch@{4}: commit: Undo change in #2375
2e199c84 bugfix/2480_grad-accumulate-0-epoch@{5}: commit: Start accumulate gradients schedule at epoch 0
e5a97999 (origin/master, origin/HEAD, master) bugfix/2480_grad-accumulate-0-epoch@{6}: branch: Created from HEAD

It looks like everything should have been reverted (the GitHub compare looks good). Any suggestions? @awaelchli

@awaelchli
Copy link
Contributor

awaelchli commented Jul 5, 2020

I suggest you simply submit an new PR. You can set the remote branch to a new name and push your commits there.

@Borda
Copy link
Member

Borda commented Jul 5, 2020

Can you rest your branch to master and commit the change again... Then we reopen this PR (it would be a sad to lose all this valuable discussion and contribution)

git checkout bugfix/2480_grad-accumulate-0-epoch
git reset upstream/master 
.... add the changes...... 
git commit.... 
git push -f

When any change appears simply reopen this PR

@HHousen
Copy link
Contributor Author

HHousen commented Jul 5, 2020

I have to open a new pull request since GitHub will not let me reopen this one even if the branch is at the state right before the request was automatically closed. Moved to #2513. It is strange though since it's the same exact branch in both requests.

@Borda
Copy link
Member

Borda commented Jul 5, 2020

I have to open a new pull request since GitHub will not let me reopen this one even if the branch is at the state right before the request was automatically closed.

It was closed because the PR state was indetical with target branch... So if there is any difference it allows you to reopen the PR...

1 similar comment
@Borda
Copy link
Member

Borda commented Jul 5, 2020

I have to open a new pull request since GitHub will not let me reopen this one even if the branch is at the state right before the request was automatically closed.

It was closed because the PR state was indetical with target branch... So if there is any difference it allows you to reopen the PR...

@HHousen
Copy link
Contributor Author

HHousen commented Jul 5, 2020

I have to open a new pull request since GitHub will not let me reopen this one even if the branch is at the state right before the request was automatically closed.

It was closed because the PR state was indetical with target branch... So if there is any difference it allows you to reopen the PR...

I rewrote history by accident which closed the PR because there were no commits in common with master. I then reset back to the commit directly before the rewrite, so everything was in the same state as it was before the rewrite. It would still not let me open after this. There are currently 5 additional commits on this branch, which #2513 recognizes but this PR will not. GitHub should let me reopen a PR if the branch to be merged is in the same state that it was in directly before the closure. Since it would not let me do this, I added a commit to remove the accidentally committed files and opened #2513. Both PRs are for the same branch, which should not be possible to my understanding. The only reason it is possible is because this one was closed and is no longer tracking changes.

Essentially, this PR would not recognize that the state of the branch was reverted and thus would not let me reopen.

I did try resetting to master and redoing the commits, but that produced identical commits and resulted in the same state as doing a git reset HEAD@{1} after my mistaken push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

For versions >0.8.2 learning rate is zero for last epoch (potentially a logging bug)