set logger level for package #1718

jeremyjordan · 2020-05-03T19:14:06Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

If a user isn't setting up any logging config on their own, we still do it for them.
If they want to set up their own logging config, this approach is less obtrusive.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Borda · 2020-05-05T13:42:11Z

tests/test_profiler.py

@@ -165,6 +167,8 @@ def test_advanced_profiler_describe(tmpdir, advanced_profiler):
    """
    ensure the profiler won't fail when reporting the summary
    """
+    trainer = Trainer()  # logging is set up when a Trainer object is initialized


what about model or any other tool except Trainer?

awaelchli · 2020-05-05T21:08:53Z

@jeremyjordan will logging before the Trainer init still work, say when we would setup callbacks, loggers etc? We would just need to make sure to import Trainer before all of that, right?

jeremyjordan · 2020-05-05T22:34:32Z

the other option which i presented here was to only add a stream handler for the lightning logger. i don't have a strong opinion regarding which way we handle it and am happy to switch this out for the alternative.

mergify · 2020-05-10T17:16:19Z

This pull request is now in conflict... :(

codecov · 2020-05-11T19:39:57Z

Codecov Report

Merging #1718 into master will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #1718   +/-   ##
======================================
  Coverage      88%     88%           
======================================
  Files          69      69           
  Lines        4304    4306    +2     
======================================
+ Hits         3794    3796    +2     
  Misses        510     510

Borda · 2020-05-11T20:36:58Z

@tgaddair this is also just caching/pip issue?

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.7/x64/bin/horovodrun", line 21, in <module>
    run_commandline()
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/run.py", line 876, in run_commandline
    _run(args)
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/run.py", line 844, in _run
    _launch_job(args, remote_host_names, settings, common_intfs, command)
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/run.py", line 867, in _launch_job
    gloo_run(settings, remote_host_names, common_intfs, env, driver_ip, command)
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 287, in gloo_run
    _launch_jobs(settings, env, host_alloc_plan, remote_host_names, run_command)
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 259, in _launch_jobs
    .format(name=name, code=exit_code))
RuntimeError: Gloo job detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 1
Exit code: 1

tgaddair · 2020-05-11T20:46:03Z

No, in this case it looks like something (possibly a race condition) is causing rank 1 to fail:

Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 61, in <module>
Mon May 11 19:20:11 2020[1]<stderr>:    run_test_from_config(json.loads(args.trainer_options))
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 49, in run_test_from_config
Mon May 11 19:20:11 2020[1]<stderr>:    run_model_test(trainer_options, model, on_gpu=args.on_gpu, version=0, with_hpc=False)
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/utils.py", line 78, in run_model_test
Mon May 11 19:20:11 2020[1]<stderr>:    pretrained_model = load_model(logger, trainer.checkpoint_callback.dirpath)
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/utils.py", line 136, in load_model
Mon May 11 19:20:11 2020[1]<stderr>:    tags_csv=tags_path
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/pytorch_lightning/core/lightning.py", line 1527, in load_from_checkpoint
Mon May 11 19:20:11 2020[1]<stderr>:    model = cls._load_model_state(checkpoint, *args, **kwargs)
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/pytorch_lightning/core/lightning.py", line 1558, in _load_model_state
Mon May 11 19:20:11 2020[1]<stderr>:    model = cls(*args, **kwargs)
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/model_template.py", line 50, in __init__
Mon May 11 19:20:11 2020[1]<stderr>:    self.__build_model()
Mon May 11 19:20:11 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/base/model_template.py", line 58, in __build_model
Mon May 11 19:20:11 2020[1]<stderr>:    in_features=self.hparams.in_features,
Mon May 11 19:20:11 2020[1]<stderr>:AttributeError: 'Namespace' object has no attribute 'in_features'

Borda · 2020-05-11T21:20:02Z

No, in this case it looks like something (possibly a race condition) is causing rank 1 to fail:

but in_features in defaults and it is filing kind of randomly... any idea what could go wrong?

tgaddair · 2020-05-11T21:54:51Z

No, in this case it looks like something (possibly a race condition) is causing rank 1 to fail:

but in_features in defaults and it is filing kind of randomly... any idea what could go wrong?

My suspicion is that it's hitting this code path when loading the model: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/core/lightning.py#L1545

That suggests that the hparams haven't been updated in the checkpoint yet, which may be a race condition (rank 1 gets to this point before rank 0 has had a chance to write out the hparams). I'll look into this further and hopefully put together a quick PR.

Borda · 2020-05-11T21:57:44Z

That suggests that the hparams haven't been updated in the checkpoint yet, which may be a race condition (rank 1 gets to this point before rank 0 has had a chance to write out the hparams). I'll look into this further and hopefully put together a quick PR.

cool, Thx, as you know sooner, better because it causes difficult time for all PRs :]

tgaddair · 2020-05-11T23:25:27Z

That suggests that the hparams haven't been updated in the checkpoint yet, which may be a race condition (rank 1 gets to this point before rank 0 has had a chance to write out the hparams). I'll look into this further and hopefully put together a quick PR.

cool, Thx, as you know sooner, better because it causes difficult time for all PRs :]

I haven't been able to repro the issue locally, but my best guess at this point is still that there's a race condition, in which case #1786 should be able to address it.

Borda · 2020-05-12T05:59:19Z

I haven't been able to repro the issue locally, but my best guess at this point is still that there's a race condition, in which case #1786 should be able to address it.

Neither me and also CircleCI does not suffer from this issue at all so I guess that it is somehow connected with GH action...

mergify bot requested a review from a team May 3, 2020 19:14

Borda added the feature Is an improvement or enhancement label May 4, 2020

Borda reviewed May 5, 2020

View reviewed changes

mergify bot requested a review from a team May 5, 2020 13:42

Borda approved these changes May 11, 2020

View reviewed changes

move logging config to trainer class init

61b88bc

mergify bot requested a review from a team May 11, 2020 19:14

alternate logging config

c09f1f8

Borda force-pushed the improve-python-logging-config branch from 41676ef to c09f1f8 Compare May 11, 2020 19:15

Borda added the ready PRs ready to be merged label May 11, 2020

Borda added this to the 0.7.6 milestone May 11, 2020

Borda changed the title ~~move logging config to trainer class init~~ set logger level for package May 11, 2020

tgaddair mentioned this pull request May 11, 2020

Join Horovod workers at the end of trainer.fit() to prevent race conditions following training #1786

Merged

williamFalcon merged commit 1df0d2d into Lightning-AI:master May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set logger level for package #1718

set logger level for package #1718

jeremyjordan commented May 3, 2020

Borda May 5, 2020

awaelchli commented May 5, 2020

jeremyjordan commented May 5, 2020 •

edited

Loading

mergify bot commented May 10, 2020

codecov bot commented May 11, 2020

Borda commented May 11, 2020

tgaddair commented May 11, 2020 •

edited

Loading

Borda commented May 11, 2020

tgaddair commented May 11, 2020

Borda commented May 11, 2020

tgaddair commented May 11, 2020

Borda commented May 12, 2020

set logger level for package #1718

set logger level for package #1718

Conversation

jeremyjordan commented May 3, 2020

Before submitting

What does this PR do?

PR review

Did you have fun?

Borda May 5, 2020

Choose a reason for hiding this comment

awaelchli commented May 5, 2020

jeremyjordan commented May 5, 2020 • edited Loading

mergify bot commented May 10, 2020

codecov bot commented May 11, 2020

Codecov Report

Borda commented May 11, 2020

tgaddair commented May 11, 2020 • edited Loading

Borda commented May 11, 2020

tgaddair commented May 11, 2020

Borda commented May 11, 2020

tgaddair commented May 11, 2020

Borda commented May 12, 2020

jeremyjordan commented May 5, 2020 •

edited

Loading

tgaddair commented May 11, 2020 •

edited

Loading