Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add useful errors when model is not configured correctly #1199

Merged
merged 21 commits into from
Apr 2, 2020

Conversation

SkafteNicki
Copy link
Member

@SkafteNicki SkafteNicki commented Mar 20, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1171 and #508

This PR adds checks, that secure that the users model is correctly configured before training. This includes:

  • Checking that training_step has been overridden
  • Checking that training_dataloader has been overridden
  • Warning if configure_optimizers has not been overridden, will tell user that program is using default optimizer (adam with lr=0.0001)
  • Error if a validation_dataloader is overridden but no validation_step is defined (and vise verse)
  • Error if a test_dataloader is overriden but no test_step is defined (and vise verse)
  • Warning if validation_dataloader, validation_step is overridden but no validation_epoch_end
  • Warning if test_dataloader, test_step is overridden but no test_epoch_end

The most fundamental change is the requirement of validation_step and test_step when there respective dataloaders are defined. This will probably not be backward compatible with some users code.

@pep8speaks
Copy link

pep8speaks commented Mar 20, 2020

Hello @SkafteNicki! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-02 14:38:44 UTC

@SkafteNicki SkafteNicki marked this pull request as ready for review March 23, 2020 14:03
@SkafteNicki SkafteNicki changed the title add check_model_configuration method Add useful errors when model is not configured correctly Mar 24, 2020
@mergify
Copy link
Contributor

mergify bot commented Mar 24, 2020

This pull request is now in conflict... :(

@Borda Borda added the feature Is an improvement or enhancement label Mar 25, 2020
@Borda Borda added this to the 0.7.2 milestone Mar 25, 2020
@Borda Borda requested review from tullie and a team March 25, 2020 08:52
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like VERY MUCH this warning check grouping, just pls have look ad my formatting proposals
(I made suggestion just to the first one but it allays to all :])

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
@Borda
Copy link
Member

Borda commented Mar 25, 2020

@neggert I guess you already were solving the DDP pickle issue, right?
tests/test_gpu_models.py::test_ddp_all_dataloaders_passed_to_fit
tells:

obj = <SpawnProcess(SpawnProcess-9, initial)> | 811s
-- | --
920 | file = <_io.BytesIO object at 0x7fa73c5d26d0>, protocol = None | 811s
921 |   | 811s
922 | def dump(obj, file, protocol=None): | 811s
923 | '''Replacement for pickle.dump() using ForkingPickler.''' | 811s
924 | >       ForkingPickler(file, protocol).dump(obj) | 811s
925 | E       TypeError: can't pickle code objects

@SkafteNicki may you pls resolve conflist in chnagelog?

@SkafteNicki
Copy link
Member Author

@Borda I update code with your recommendations and have solved the conflict. Only the pickle issues seems to be blocking a merge.

Copy link
Contributor

@jeremyjordan jeremyjordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great @SkafteNicki , thanks for your work on this! can you add tests to ensure that the exceptions are raised when the model is misconfigured?

@SkafteNicki
Copy link
Member Author

@jeremyjordan I have added the test that you requested

Copy link
Member

@ethanwharris ethanwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good :) - minor weirdness in the CHANGELOG (see comment)

@SkafteNicki @Borda @PyTorchLightning/core-contributors Wondering if the required methods should be made proper abstract methods in LightningModule so that IDEs will prompt to implement them?

CHANGELOG.md Outdated Show resolved Hide resolved
@jeremyjordan
Copy link
Contributor

some related work going on in #1279

@mergify
Copy link
Contributor

mergify bot commented Mar 29, 2020

This pull request is now in conflict... :(

@mergify mergify bot requested a review from a team March 30, 2020 22:33
@mergify
Copy link
Contributor

mergify bot commented Mar 30, 2020

This pull request is now in conflict... :(

@awaelchli
Copy link
Member

Awesome!

@Borda
Copy link
Member

Borda commented Mar 31, 2020

@SkafteNicki nice job, could you pls resolve conflicts...

@SkafteNicki
Copy link
Member Author

@Borda conflicts are solved now, some (unrelated) tests are still failing...

@Borda
Copy link
Member

Borda commented Mar 31, 2020

@SkafteNicki I have restarted the GitHub, but the GPU is failing because of

def dump(obj, file, protocol=None): | 801s
-- | --
1709 | '''Replacement for pickle.dump() using ForkingPickler.''' | 801s
1710 | >       ForkingPickler(file, protocol).dump(obj) | 801s
1711 | E       TypeError: can't pickle code objects

I thought we were fixing it... @neggert?

@mergify
Copy link
Contributor

mergify bot commented Mar 31, 2020

This pull request is now in conflict... :(

Copy link
Contributor

@williamFalcon williamFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, apparently we duplicated with #1317

We need to still allow a LightningModule to function like a regular PyTorch module. So, instead of raising errors, we need to give warnings.

Want to merge my changes into this PR instead?

The main other thing in my PR is that configure_optimizers no longer returns a default Adam.

@Borda
Copy link
Member

Borda commented Mar 31, 2020

could we merge this one and then Will's as followup?

@SkafteNicki
Copy link
Member Author

As the checks implemented in this PR is only called within Trainer.fit() I think that they should indeed throw an error instead of a warning. This means that the user can freely use LightningModules as any other Pytorch modules until they try to make use of lightnings trainer functionality, in which we enforce some structure.

If you want to merge #1317 into this, fine by me. I will update configure_optimizers error/warning afterwards.

@mergify
Copy link
Contributor

mergify bot commented Apr 2, 2020

This pull request is now in conflict... :(

@williamFalcon
Copy link
Contributor

@SkafteNicki awesome!

@williamFalcon williamFalcon merged commit 2912239 into Lightning-AI:master Apr 2, 2020
@Borda
Copy link
Member

Borda commented Apr 2, 2020

@williamFalcon @SkafteNicki there is still a bug and now it is in master

def dump(obj, file, protocol=None): | 1272s
-- | --
1712 | '''Replacement for pickle.dump() using ForkingPickler.''' | 1272s
1713 | >       ForkingPickler(file, protocol).dump(obj) | 1272s
1714 | E       TypeError: can't pickle code objects

http://35.192.60.23/PyTorchLightning/pytorch-lightning/954/1/2

@williamFalcon
Copy link
Contributor

probably that code call

@neggert
Copy link
Contributor

neggert commented Apr 2, 2020

Yeah test_ddp_all_dataloaders_passed_to_fit is failing. I'd imagine it's this line:

self.__code__ = self.__call__.__code__

Seems that code object isn't pickleable. We need everything to be pickleable for DDP to work.

alexeykarnachev pushed a commit to alexeykarnachev/pytorch-lightning that referenced this pull request Apr 3, 2020
…I#1199)

* add check_model_configuration method

* trying to fix errors

* trying to fix tests

* added test_epoch_end to lightning template

* fix tests

* fix new test after rebase

* fix spelling

* added more checks

* updated formating

* added tests

* fixed CHANGELOG

* Apply suggestions from code review

* move test to new module

* change check on configure_optimizers

Co-authored-by: Nicki Skafte <nugginea@gmail.com>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
@SkafteNicki SkafteNicki deleted the add_check_model_config branch April 21, 2020 13:52
@tbachlechner
Copy link

It appears that training_dataloader does not exist, I think it should read train_dataloader? Another question is naming convention consistency with training_step.

@Borda Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check that model is configured correctly
10 participants