Collect all the training_step outputs for training epoch end #2354

mmiakashs · 2020-06-25T01:03:28Z

Fixes #2320
Bug Fixes of #2320 after the #2328 commits

pytorch_lightning/trainer/training_loop.py

williamFalcon · 2020-06-25T02:26:02Z

pytorch_lightning/trainer/training_loop.py

@@ -690,7 +694,7 @@ def run_training_batch(self, batch, batch_idx):
            signal=0,
            grad_norm_dic=grad_norm_dic,
            batch_log_metrics=batch_log_metrics,
-            training_step_output_for_epoch_end=opt_closure_result.training_step_output_for_epoch_end
+            training_step_output_for_epoch_end=all_training_step_output_for_epoch_end


i believe the function that calls this method also needs to be fixed to take into account the list of outputs.

training_step_output_for_epoch_end=opt_closure_result.training_step_output_for_epoch_end

this line only capture the last iteration outputs and discard the previous iterations.

williamFalcon · 2020-06-25T02:28:12Z

pytorch_lightning/trainer/training_loop.py

@@ -690,7 +694,7 @@ def run_training_batch(self, batch, batch_idx):
            signal=0,
            grad_norm_dic=grad_norm_dic,
            batch_log_metrics=batch_log_metrics,
-            training_step_output_for_epoch_end=opt_closure_result.training_step_output_for_epoch_end
+            training_step_output_for_epoch_end=training_step_output_for_epoch_end_list


this here is now a list of outputs. This means we need to make sure the method which called this processes that.

sounds good 😄

williamFalcon · 2020-06-25T15:01:40Z

wait... i was thinking about this. I still think this PR is not correct.

We don't want the output from EVERY tbptt split... only the last one which is what this code does today

mmiakashs · 2020-06-25T15:55:31Z

wait... i was thinking about this. I still think this PR is not correct.

We don't want the output from EVERY tbptt split... only the last one which is what this code does today

In that case, we will miss some outputs, which will lead to the wrong metric calculation at the end of the epoch. isn't it?

Borda · 2020-07-01T11:45:32Z

How is it going here?

mmiakashs · 2020-07-01T14:59:41Z

How is it going here?

I couldn't able to figure out the failed tests issue. However, I am using this PR changes in my local machine and it is working perfectly.

mergify · 2020-07-20T23:02:59Z

This pull request is now in conflict... :(

mmiakashs · 2020-07-23T01:10:33Z

@Borda and @williamFalcon guys did you get the chance to look into this issue? I was using this PR version and it is working properly at my end. I didn't find the cause of this conflict.

Borda · 2020-07-23T11:34:57Z

is there still the issue, as #2320 is closed?

williamFalcon · 2020-08-09T10:10:49Z

Finished in #2890

collect all the training_step outputs for training epoch end

5a287dd

mergify bot requested a review from a team June 25, 2020 01:04

williamFalcon reviewed Jun 25, 2020

View reviewed changes

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

mergify bot requested review from a team June 25, 2020 02:20

williamFalcon reviewed Jun 25, 2020

View reviewed changes

mergify bot requested a review from a team June 25, 2020 02:26

Update training_loop.py

fc5677f

williamFalcon reviewed Jun 25, 2020

View reviewed changes

mergify bot requested a review from a team June 25, 2020 02:28

Borda added bug Something isn't working feature Is an improvement or enhancement labels Jun 25, 2020

Borda added this to the 0.9.0 milestone Aug 6, 2020

williamFalcon mentioned this pull request Aug 9, 2020

tracks all outputs including TBPTT and multiple optimizers #2890

Merged

williamFalcon closed this Aug 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect all the training_step outputs for training epoch end #2354

Collect all the training_step outputs for training epoch end #2354

mmiakashs commented Jun 25, 2020 •

edited by Borda

Loading

williamFalcon Jun 25, 2020

mmiakashs Jun 25, 2020

williamFalcon Jun 25, 2020

mmiakashs Jun 25, 2020

williamFalcon commented Jun 25, 2020

mmiakashs commented Jun 25, 2020

Borda commented Jul 1, 2020

mmiakashs commented Jul 1, 2020

mergify bot commented Jul 20, 2020

mmiakashs commented Jul 23, 2020

Borda commented Jul 23, 2020

williamFalcon commented Aug 9, 2020

Collect all the training_step outputs for training epoch end #2354

Collect all the training_step outputs for training epoch end #2354

Conversation

mmiakashs commented Jun 25, 2020 • edited by Borda Loading

williamFalcon Jun 25, 2020

Choose a reason for hiding this comment

mmiakashs Jun 25, 2020

Choose a reason for hiding this comment

williamFalcon Jun 25, 2020

Choose a reason for hiding this comment

mmiakashs Jun 25, 2020

Choose a reason for hiding this comment

williamFalcon commented Jun 25, 2020

mmiakashs commented Jun 25, 2020

Borda commented Jul 1, 2020

mmiakashs commented Jul 1, 2020

mergify bot commented Jul 20, 2020

mmiakashs commented Jul 23, 2020

Borda commented Jul 23, 2020

williamFalcon commented Aug 9, 2020

mmiakashs commented Jun 25, 2020 •

edited by Borda

Loading