Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callback Metric Dict getting overwritten by Log and Progress Bar Dict #1800

Conversation

olineumann
Copy link
Contributor

@olineumann olineumann commented May 12, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)

  • Did you read the contributor guideline, Pull Request section?

  • Did you make sure to update the docs?

  • Did you write any new necessary tests?

  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1727.

Dict values passed to progress bar or log overwriting callback values. See example in issue.

There are several options to solve it. This simply removes adding progress bar and log values to callback dict. Tests passed on my machine.

But this will affect users code e.g. when log metric as early stopping metric was used

PR review

Opinions, other solutions, recommendations, ... welcome! Also help in updating docs.

Did you have fun?

🙃

@mergify mergify bot requested a review from a team May 12, 2020 14:31
@mergify
Copy link
Contributor

mergify bot commented May 12, 2020

This pull request is now in conflict... :(

@awaelchli
Copy link
Member

@olineumann Thanks for the PR. Is it correct that the bug only exists for training_epoch end, not for the valid/test_epoch_end? In this case, could you check that your change brings it in line with validation_step/epoch_end.

But this will affect users code e.g. when log metric as early stopping metric was used

I think consensus is that we want to do early stopping only on validation metrics, and no more on training metrics as it is right now. #1458 is dealing with this.

@awaelchli awaelchli added the bug Something isn't working label May 12, 2020
@olineumann
Copy link
Contributor Author

olineumann commented May 12, 2020

@awaelchli No it should effect train, validation and test epoch end. See my changes on validation epoch end of the base model in the tests.

@@ -43,5 +43,7 @@ def _mean(res, key):
val_acc_mean /= len(outputs)

metrics_dict = {'val_loss': val_loss_mean.item(), 'val_acc': val_acc_mean.item()}
results = {'progress_bar': metrics_dict, 'log': metrics_dict}
return results
result = metrics_dict.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the copy here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without copying the metric dict, result and metric dict reference the same object. So adding metric dict to result['progress_bar'] would also change metric_dict. If then adding metric dict to result['log'], result['log']['progress_bar'] would exist and cause errors in tests in my machine.

First I reused the metric_dict by

metric_dict['progress_bar'] = metric_dict
metric_dict['log'] = metric_dict
return metric_dict

But this is wrong and leads to the same error.

@mergify mergify bot requested a review from a team May 12, 2020 21:12
@@ -168,10 +168,6 @@ def process_output(self, output, train=False):
# ---------------
hiddens = output.get('hiddens')

# use every metric passed in as a candidate for callback
callback_metrics.update(progress_bar_metrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to remove this?
without this log metrics and progress bar metrics won't be candidates for the callbacks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #1727 @kessido had the issue, that progress bar or log metric overwrites the callback metric of the top layer dict. An example was also given by @kessido see COLAB

I don't know if this needs to be fixed, that's why I asked in the issue for more opinions. Only @awaelchli responded and said he thinks that this also needs to be fixed.

Because no one started a PR I did to initiate a discussion. I have several ideas on how this could be fixed and mentioned some in the issue above. But this was the easiest and quickest solution. I didn't want to spend too much afford on a solution which then will be discarded.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamFalcon @olineumann, in the current update, when that line is removed and we use Result Obj, we cannot save the model checkpoint in form of {val_loss}, it will result epoch=1-val_loss=0 which cannot get the val_loss due to the filename params based on the callback_metrics. Is there another way to assign Result/TrainResul/EvalResult Obj with callback_metrics.

@mergify mergify bot requested a review from a team May 17, 2020 13:04
@mergify
Copy link
Contributor

mergify bot commented May 17, 2020

This pull request is now in conflict... :(

@Borda Borda force-pushed the issue/callback_metric_overwritten branch from 4e24924 to dde55a8 Compare May 26, 2020 17:32
@codecov
Copy link

codecov bot commented May 26, 2020

Codecov Report

Merging #1800 into master will decrease coverage by 3%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master   #1800     +/-   ##
========================================
- Coverage      89%     86%     -3%     
========================================
  Files          79      78      -1     
  Lines        7302    4919   -2383     
========================================
- Hits         6514    4231   -2283     
+ Misses        788     688    -100     

@mergify
Copy link
Contributor

mergify bot commented May 28, 2020

This pull request is now in conflict... :(

@Borda Borda added the waiting on author Waiting on user action, correction, or update label Jun 8, 2020
@Borda
Copy link
Member

Borda commented Jun 11, 2020

@olineumann mind check last comments? it would be great to get this done 🐰

@olineumann
Copy link
Contributor Author

@olineumann mind check last comments? it would be great to get this done 🐰

Hey Borda,

thanks for replying.

I responded to the last comments on the code reviews. I still not sure what the best way would be to solve the problem. Because the current fix would affect many users which will lead to many issues I think from users complaining their logging or early stopping won't work anymore.

I could implement that only metric values from progress bar or logging would be written to the top dict if the key didn't exists already. That wouldn't affect so much users I think.

I hoped that there were more opinions on that. I could implement the solution above, rebase and push so it could be merged.

@olineumann olineumann force-pushed the issue/callback_metric_overwritten branch from dde55a8 to 1c6bfaa Compare June 11, 2020 10:40
@pep8speaks
Copy link

pep8speaks commented Jun 11, 2020

Hello @olineumann! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-06 12:01:29 UTC

@olineumann
Copy link
Contributor Author

@Borda Just rebased to master, implemented and pushed the solution, and all tests passing 🍻.

Now the logging and progress bar metric values are only written to the top-level callback metric dict if the key didn't exist. Also at first, the logging values are written, and then the progress bar values (so logging metric values have a higher priority if both containing the same key). This shouldn't affect other users' code as long as they didn't use the same key in different metric dicts.

@Borda
Copy link
Member

Borda commented Jun 11, 2020

just thinking that it may be also solved by #1989 what do you think?

@olineumann
Copy link
Contributor Author

olineumann commented Jun 11, 2020

just thinking that it may be also solved by #1989 what do you think?

Didn't saw the PR before. Currently, I have not so much time to follow pytorch_lightning... But looks like a nice new feature!

I think when using the new way by passing a Result() object it is already solved. But currently the PR isn't done yet so the old way will still be used and also, as far I understand, the old way should still be supported. So I think this PR could be merged into master to fix #1727 (which wouldn't be fixed by #1989 as long as the user switches to the new result object).

@mergify
Copy link
Contributor

mergify bot commented Jun 18, 2020

This pull request is now in conflict... :(

4 similar comments
@mergify
Copy link
Contributor

mergify bot commented Jun 20, 2020

This pull request is now in conflict... :(

@mergify
Copy link
Contributor

mergify bot commented Jun 30, 2020

This pull request is now in conflict... :(

@mergify
Copy link
Contributor

mergify bot commented Jul 3, 2020

This pull request is now in conflict... :(

@mergify
Copy link
Contributor

mergify bot commented Jul 21, 2020

This pull request is now in conflict... :(

@Borda
Copy link
Member

Borda commented Aug 6, 2020

@olineumann how is it going? can we finish it soon...

@Borda Borda force-pushed the issue/callback_metric_overwritten branch from f14fb59 to a1c0da6 Compare August 6, 2020 12:01
@williamFalcon
Copy link
Contributor

this was solved in the structured results refactors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Progress bar \ log dict items added to outputs in training_epoch_end
6 participants