Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Misstep #2478

Closed
wants to merge 101 commits into from
Closed

fix Misstep #2478

wants to merge 101 commits into from

Conversation

ameliatqy
Copy link

@ameliatqy ameliatqy commented Jul 3, 2020

What does this PR do?

Fixes #2455. This used to be #2475 but I messed up on rebasing and to be safe, created a new pull request. This is a Draft PR.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Yes, I am new to contributing to open-source and actually find it quite fun. May be time to consider a new hobby...

@ameliatqy ameliatqy mentioned this pull request Jul 3, 2020
7 tasks
@mergify mergify bot requested a review from a team July 3, 2020 00:20
@@ -504,6 +505,9 @@ def run_training_epoch(self):
# epoch end hook
self.run_on_epoch_end_hook(model)

# increate global step by one to progress to the next epoch
self.global_step += 1
Copy link
Member

@awaelchli awaelchli Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to do self.increment_accumulated_grad_global_step() here, otherwise the total_train_batch_idx does not get updated. I think that's why the tests are failing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change and it works as expected. So I have changed it.

However, the tests are still failing. Actually, I think I am confused about the test. I ran the local test on my pull request and got this total summary result at the end:
Screen Shot 2020-07-03 at 12 40 46 AM

I also ran the test over the original Pytorch Lightning master branch and also got this total summary result at the end:
Screen Shot 2020-07-03 at 12 34 03 AM

Am I making a mistake somewhere? Should I have based my branch off on something other than the master branch?

@mergify mergify bot requested a review from a team July 3, 2020 06:14
Comment on lines +105 to +110
# check if dictionary keys are unique
agg_keys = set([key for met in self._metrics_to_agg for key in met.keys()])
num_keys = sum([len(met) for met in self._metrics_to_agg])

# exclude 'epoch' because it is a metric automatically added in by log_metrics and will count as a
# duplicate. If you want to get rid of this, I would suggest you should get rid of `scalar_metrics[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the changes here really related to the global step issue?
If possible could we make a separate PR for this? we only want to fix one thing at a time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see it being a separate issue. But, this piece of code is necessary for the code to run smoothly. Otherwise, when we fix the step issue (i.e. combining the last training batch + training_epoch_end + validation_epoch_end metrics in the same step), there will be an error combining the three pieces of data together because the original merge_dict() in _reduce_agg_metrics() requires that all the metric dictionaries have the same keys - which for these three types of data is usually not the case.

I don't mind doing a separate pull request for this though - but it will probably result in most tests failing for this pull request, which we will probably have to make sure to fix in the new pull request.

@mergify mergify bot requested a review from a team July 3, 2020 06:17
@Borda Borda added the bug Something isn't working label Jul 3, 2020
@ameliatqy ameliatqy marked this pull request as ready for review July 9, 2020 23:46
@mergify
Copy link
Contributor

mergify bot commented Jul 20, 2020

This pull request is now in conflict... :(

@Borda Borda added this to the 0.9.0 milestone Aug 6, 2020
@Borda Borda changed the title [Bugfix] Misstep fix Misstep Aug 11, 2020
@Borda
Copy link
Member

Borda commented Aug 11, 2020

@ameliatqy I have rebased it to master, can we finish it in this release?

@pep8speaks
Copy link

pep8speaks commented Aug 11, 2020

Hello @ameliatqy! Thanks for updating this PR.

Line 551:13: E303 too many blank lines (2)

Comment last updated at 2020-08-18 02:14:57 UTC

@codecov
Copy link

codecov bot commented Aug 11, 2020

Codecov Report

Merging #2478 into master will decrease coverage by 0%.
The diff coverage is 79%.

@@          Coverage Diff           @@
##           master   #2478   +/-   ##
======================================
- Coverage      90%     90%   -0%     
======================================
  Files          81      81           
  Lines        7672    7689   +17     
======================================
+ Hits         6906    6918   +12     
- Misses        766     771    +5     

@ameliatqy
Copy link
Author

@ameliatqy I have rebased it to master, can we finish it in this release?

I have worked on the pull request and the checks have all passed. How do I get it approved? I see that I need three approving reviews - am I responsible for finding people to review it or will the reviewers be assigned by someone else?

@Borda Borda requested a review from awaelchli August 11, 2020 23:07
@awaelchli
Copy link
Member

awaelchli commented Aug 16, 2020

@ameliatqy I have rebased it to master, can we finish it in this release?

I have worked on the pull request and the checks have all passed. How do I get it approved? I see that I need three approving reviews - am I responsible for finding people to review it or will the reviewers be assigned by someone else?

Sorry for this late response.
Anyone in the community can review, but only the reviews of the core team count towards a merge decision.

I have two questions:

  1. Does the issue with global step still exist on master?
  2. Why do we need a decrement in the global_step? I would prefer a solution where this is not needed. Also, there was recently a fix Fix accumulate_grad_batches for last batch #2853 in accumulate batches for the last batch, so I have the feeling this relates here.

@mergify
Copy link
Contributor

mergify bot commented Aug 16, 2020

This pull request is now in conflict... :(

@williamFalcon
Copy link
Contributor

please rebase master since this code is super old haha

@ameliatqy
Copy link
Author

please rebase master since this code is super old haha

Interesting, I thought I rebased my branch today XD I just rebased it again - does it look right? Rebasing is obviously not my strong point XD

@mergify
Copy link
Contributor

mergify bot commented Aug 20, 2020

This pull request is now in conflict... :(

@edenlightning edenlightning modified the milestones: 0.9.0, 0.9.x Aug 20, 2020
@Borda
Copy link
Member

Borda commented Sep 18, 2020

@ameliatqy how about this one, mind rebase and resolve conflicts? 🐰

@ameliatqy
Copy link
Author

ameliatqy commented Sep 23, 2020

@ameliatqy how about this one, mind rebase and resolve conflicts? 🐰

Sure, I'd be happy to. But to fully resolve this pull request, could someone answer my question in #2478 (comment) ? I can't fully finish up the pull request without an answer to this question.

@@ -73,8 +74,6 @@ def log_metrics(self, metrics, grad_norm_dic, step=None):
# log actual metrics
if self.is_global_zero and self.logger is not None:
self.logger.agg_and_log_metrics(scalar_metrics, step=step)
self.logger.save()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this related to #3398 (comment)?
@SkafteNicki Maybe you know something about this also?

@edenlightning edenlightning modified the milestones: 0.9.x, 1.0 Oct 4, 2020
@awaelchli
Copy link
Member

This is now fixed on master. There is no offset anymore and training epoch end logs at the correct step.

@awaelchli awaelchli closed this Oct 4, 2020
@Borda Borda modified the milestones: 1.0, 0.10.0 Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

training_epoch_end log output gets combined with next epoch training
7 participants