Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpooint Callback not called when training GAN from documentation #969

Closed
PasqualeZingo opened this issue Feb 27, 2020 · 8 comments
Closed
Labels
question Further information is requested

Comments

@PasqualeZingo
Copy link

PasqualeZingo commented Feb 27, 2020

I tried to add a model checkpoint callback to the GAN example from the documentation (https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/domain_templates/gan.py), but it is failing silently; no checkpoints are being written. I can't find anything in the checkpoint docs that indicate why it is not working.

What's your environment?

  • OS: Linux
  • Packaging: Conda
  • Version : 0.6, pytorch 1.4
@PasqualeZingo PasqualeZingo added the question Further information is requested label Feb 27, 2020
@github-actions
Copy link
Contributor

Hey, thanks for your contribution! Great first issue!

@jeremyjordan
Copy link
Contributor

@PasqualeZingo could you reproduce the error in a Colab notebook and share the link?

@PasqualeZingo
Copy link
Author

@jeremyjordan
Copy link
Contributor

Ah ok I think I know what's going on here.

If you look at the checkpointing callback, it's invoked with on_validation_end.

However, because you don't have a validation_step defined, the trainer is skipping validation. As a result, the callback to save the model checkpoints is never actually called.

@williamFalcon
Copy link
Contributor

should we enable the checkpoint callback to use any metric across training or validation step?

@PasqualeZingo
Copy link
Author

That would be ideal, in the mean time I can fool the validation loop into running for my use. Would it be worth updating the docs in the mean time to clarify about the checkpoint callback?

@jeremyjordan
Copy link
Contributor

Yes, we should be running this on_epoch_end after both training and validation are complete.

@williamFalcon
Copy link
Contributor

merged #1043

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
3 participants