Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect all the training_step outputs for training epoch end #2354

Conversation

mmiakashs
Copy link

@mmiakashs mmiakashs commented Jun 25, 2020

Fixes #2320
Bug Fixes of #2320 after the #2328 commits

@mergify mergify bot requested a review from a team June 25, 2020 01:04
@mergify mergify bot requested review from a team June 25, 2020 02:20
@@ -690,7 +694,7 @@ def run_training_batch(self, batch, batch_idx):
signal=0,
grad_norm_dic=grad_norm_dic,
batch_log_metrics=batch_log_metrics,
training_step_output_for_epoch_end=opt_closure_result.training_step_output_for_epoch_end
training_step_output_for_epoch_end=all_training_step_output_for_epoch_end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe the function that calls this method also needs to be fixed to take into account the list of outputs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

training_step_output_for_epoch_end=opt_closure_result.training_step_output_for_epoch_end

this line only capture the last iteration outputs and discard the previous iterations.

@mergify mergify bot requested a review from a team June 25, 2020 02:26
@@ -690,7 +694,7 @@ def run_training_batch(self, batch, batch_idx):
signal=0,
grad_norm_dic=grad_norm_dic,
batch_log_metrics=batch_log_metrics,
training_step_output_for_epoch_end=opt_closure_result.training_step_output_for_epoch_end
training_step_output_for_epoch_end=training_step_output_for_epoch_end_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this here is now a list of outputs. This means we need to make sure the method which called this processes that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good 😄

@mergify mergify bot requested a review from a team June 25, 2020 02:28
@williamFalcon
Copy link
Contributor

wait... i was thinking about this. I still think this PR is not correct.

We don't want the output from EVERY tbptt split... only the last one which is what this code does today

@mmiakashs
Copy link
Author

wait... i was thinking about this. I still think this PR is not correct.

We don't want the output from EVERY tbptt split... only the last one which is what this code does today

In that case, we will miss some outputs, which will lead to the wrong metric calculation at the end of the epoch. isn't it?

@Borda Borda added bug Something isn't working feature Is an improvement or enhancement labels Jun 25, 2020
@Borda
Copy link
Member

Borda commented Jul 1, 2020

How is it going here?

@mmiakashs
Copy link
Author

How is it going here?

I couldn't able to figure out the failed tests issue. However, I am using this PR changes in my local machine and it is working perfectly.

@mergify
Copy link
Contributor

mergify bot commented Jul 20, 2020

This pull request is now in conflict... :(

@mmiakashs
Copy link
Author

@Borda and @williamFalcon guys did you get the chance to look into this issue? I was using this PR version and it is working properly at my end. I didn't find the cause of this conflict.

@Borda
Copy link
Member

Borda commented Jul 23, 2020

is there still the issue, as #2320 is closed?

@williamFalcon
Copy link
Contributor

Finished in #2890

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partially missing training_step outputs in training_epoch_end
3 participants