Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partially missing training_step outputs in training_epoch_end #2320

Closed
mmiakashs opened this issue Jun 22, 2020 · 4 comments · Fixed by #2328 or #2890
Closed

Partially missing training_step outputs in training_epoch_end #2320

mmiakashs opened this issue Jun 22, 2020 · 4 comments · Fixed by #2328 or #2890
Labels
bug Something isn't working question Further information is requested

Comments

@mmiakashs
Copy link

mmiakashs commented Jun 22, 2020

Learning Model: the model consists of two branches: teacher and student. Two losses have been used to train two branches of the learning model using two optimizers.

ISSUE: In the training_step, I produced two set of metrics outputs for teacher (optimizer_idx=0) and student (optimizer_idx=1) branches (teacher: {loss,log:{acc, f1}} and student: {loss,log:{acc,f1,precision}} ). But in the training_epoch_end, I only got the combined outputs for only student outputs (optimizer_idx=1). All the teacher outputs (optimizer_idx=0) are missing.

I also see the training loop of PL and didn't observe any issue when the training loop tries to combine the training_step outputs. I am not sure what am I missing?

  • Version: 0.8.2.dev0
@mmiakashs mmiakashs added the question Further information is requested label Jun 22, 2020
@mmiakashs
Copy link
Author

I dig the PL training_loop, may be I found a possible issue. In the run_training_batch method in training_loop.py, see the following two lines:

Line 639: loss, batch_output = optimizer_closure()
and 
Line 691: return 0, grad_norm_dic, all_log_metrics, batch_output

It seems to me that the batch_output of the last split_batch and last opt_idx iteration is just returned. Shouldn't we collect all the split batch and optimizers iteration batch_output
just like the all_log_metrics?

Moreover, in the run_training_epoch method, it seems to me that only the last iteration batch_output is passed to the training_epoch_end, see the following lines in the training_loop.py

Line 459: batch_result, grad_norm_dic, batch_step_metrics, batch_output = _outputs
Line 464: outputs.append(batch_output)
Line 526: epoch_output = model.training_epoch_end(outputs)

@williamFalcon @Borda could you please give a look into it?

@mmiakashs
Copy link
Author

mmiakashs commented Jun 22, 2020

I dig the PL training_loop, may be I found a possible issue. In the run_training_batch method in training_loop.py, see the following two lines:

Line 639: loss, batch_output = optimizer_closure()
and 
Line 691: return 0, grad_norm_dic, all_log_metrics, batch_output

It seems to me that the batch_output of the last split_batch and last opt_idx iteration is just returned. Shouldn't we collect all the split batch and optimizers iteration batch_output
just like the all_log_metrics?

Moreover, in the run_training_epoch method, it seems to me that only the last iteration batch_output is passed to the training_epoch_end, see the following lines in the training_loop.py

Line 459: batch_result, grad_norm_dic, batch_step_metrics, batch_output = _outputs
Line 464: outputs.append(batch_output)
Line 526: epoch_output = model.training_epoch_end(outputs)

@williamFalcon @Borda could you please give a look into it?

I debug a bit I think the problem is with the multi optimizer only the last iteration (split batch and optimizer iteration) output is passed. If you can confirm this problem I can resolve and request a pull request. Thanks

@williamFalcon
Copy link
Contributor

@mmiakashs can you check that this works for you now on master?

@mmiakashs
Copy link
Author

@williamFalcon Thanks for the update. Sorry I missed that notification. I will check this after my deadline next week :)
Just a followup question. Do we need to turn on any flag for this or by default it merges and gives the combined results in the epoch_end results object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
3 participants