Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set logger level for package #1718

Merged

Conversation

jeremyjordan
Copy link
Contributor

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1503

If a user isn't setting up any logging config on their own, we still do it for them.
If they want to set up their own logging config, this approach is less obtrusive.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team May 3, 2020 19:14
@Borda Borda added the feature Is an improvement or enhancement label May 4, 2020
@@ -165,6 +167,8 @@ def test_advanced_profiler_describe(tmpdir, advanced_profiler):
"""
ensure the profiler won't fail when reporting the summary
"""
trainer = Trainer() # logging is set up when a Trainer object is initialized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about model or any other tool except Trainer?

@mergify mergify bot requested a review from a team May 5, 2020 13:42
@awaelchli
Copy link
Member

@jeremyjordan will logging before the Trainer init still work, say when we would setup callbacks, loggers etc? We would just need to make sure to import Trainer before all of that, right?

@jeremyjordan
Copy link
Contributor Author

jeremyjordan commented May 5, 2020

the other option which i presented here was to only add a stream handler for the lightning logger. i don't have a strong opinion regarding which way we handle it and am happy to switch this out for the alternative.

@mergify
Copy link
Contributor

mergify bot commented May 10, 2020

This pull request is now in conflict... :(

@mergify mergify bot requested a review from a team May 11, 2020 19:14
@Borda Borda force-pushed the improve-python-logging-config branch from 41676ef to c09f1f8 Compare May 11, 2020 19:15
@Borda Borda added the ready PRs ready to be merged label May 11, 2020
@Borda Borda added this to the 0.7.6 milestone May 11, 2020
@Borda Borda changed the title move logging config to trainer class init set logger level for package May 11, 2020
@codecov
Copy link

codecov bot commented May 11, 2020

Codecov Report

Merging #1718 into master will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #1718   +/-   ##
======================================
  Coverage      88%     88%           
======================================
  Files          69      69           
  Lines        4304    4306    +2     
======================================
+ Hits         3794    3796    +2     
  Misses        510     510           

@Borda
Copy link
Member

Borda commented May 11, 2020

@tgaddair this is also just caching/pip issue?

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.7/x64/bin/horovodrun", line 21, in <module>
    run_commandline()
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/run.py", line 876, in run_commandline
    _run(args)
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/run.py", line 844, in _run
    _launch_job(args, remote_host_names, settings, common_intfs, command)
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/run.py", line 867, in _launch_job
    gloo_run(settings, remote_host_names, common_intfs, env, driver_ip, command)
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 287, in gloo_run
    _launch_jobs(settings, env, host_alloc_plan, remote_host_names, run_command)
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 259, in _launch_jobs
    .format(name=name, code=exit_code))
RuntimeError: Gloo job detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 1
Exit code: 1

@tgaddair
Copy link
Contributor

tgaddair commented May 11, 2020

No, in this case it looks like something (possibly a race condition) is causing rank 1 to fail:

Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 61, in <module>
Mon May 11 19:20:11 2020[1]<stderr>:    run_test_from_config(json.loads(args.trainer_options))
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 49, in run_test_from_config
Mon May 11 19:20:11 2020[1]<stderr>:    run_model_test(trainer_options, model, on_gpu=args.on_gpu, version=0, with_hpc=False)
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/utils.py", line 78, in run_model_test
Mon May 11 19:20:11 2020[1]<stderr>:    pretrained_model = load_model(logger, trainer.checkpoint_callback.dirpath)
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/utils.py", line 136, in load_model
Mon May 11 19:20:11 2020[1]<stderr>:    tags_csv=tags_path
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/pytorch_lightning/core/lightning.py", line 1527, in load_from_checkpoint
Mon May 11 19:20:11 2020[1]<stderr>:    model = cls._load_model_state(checkpoint, *args, **kwargs)
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/pytorch_lightning/core/lightning.py", line 1558, in _load_model_state
Mon May 11 19:20:11 2020[1]<stderr>:    model = cls(*args, **kwargs)
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/model_template.py", line 50, in __init__
Mon May 11 19:20:11 2020[1]<stderr>:    self.__build_model()
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/model_template.py", line 58, in __build_model
Mon May 11 19:20:11 2020[1]<stderr>:    in_features=self.hparams.in_features,
Mon May 11 19:20:11 2020[1]<stderr>:AttributeError: 'Namespace' object has no attribute 'in_features'

@Borda
Copy link
Member

Borda commented May 11, 2020

No, in this case it looks like something (possibly a race condition) is causing rank 1 to fail:

but in_features in defaults and it is filing kind of randomly... any idea what could go wrong?

@tgaddair
Copy link
Contributor

No, in this case it looks like something (possibly a race condition) is causing rank 1 to fail:

but in_features in defaults and it is filing kind of randomly... any idea what could go wrong?

My suspicion is that it's hitting this code path when loading the model: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/core/lightning.py#L1545

That suggests that the hparams haven't been updated in the checkpoint yet, which may be a race condition (rank 1 gets to this point before rank 0 has had a chance to write out the hparams). I'll look into this further and hopefully put together a quick PR.

@Borda
Copy link
Member

Borda commented May 11, 2020

That suggests that the hparams haven't been updated in the checkpoint yet, which may be a race condition (rank 1 gets to this point before rank 0 has had a chance to write out the hparams). I'll look into this further and hopefully put together a quick PR.

cool, Thx, as you know sooner, better because it causes difficult time for all PRs :]

@tgaddair
Copy link
Contributor

That suggests that the hparams haven't been updated in the checkpoint yet, which may be a race condition (rank 1 gets to this point before rank 0 has had a chance to write out the hparams). I'll look into this further and hopefully put together a quick PR.

cool, Thx, as you know sooner, better because it causes difficult time for all PRs :]

I haven't been able to repro the issue locally, but my best guess at this point is still that there's a race condition, in which case #1786 should be able to address it.

@williamFalcon williamFalcon merged commit 1df0d2d into Lightning-AI:master May 12, 2020
@Borda
Copy link
Member

Borda commented May 12, 2020

I haven't been able to repro the issue locally, but my best guess at this point is still that there's a race condition, in which case #1786 should be able to address it.

Neither me and also CircleCI does not suffer from this issue at all so I guess that it is somehow connected with GH action...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not configure python logging
5 participants