.fit() returns last not best weights in ddp_spawn #2565

williamFalcon · 2020-07-09T10:46:34Z

awaelchli · 2020-07-09T10:51:33Z

pytorch_lightning/trainer/distrib_data_parallel.py

@@ -559,9 +560,13 @@ def ddp_train(self, process_idx, q, model, is_master=False, proc_offset=0):
        torch.cuda.empty_cache()

        if self.global_rank == 0 and q is not None:
+            rank_zero_warn('cleaning up ddp environment...')
            q.put(self.checkpoint_callback.best_model_path)


could we add a None check here for checkpoint_callback? because the user can set it to None, if they want.
See #2547

yukw777

Could you explain this PR a bit? Just curious since it touched the code I touched and I want to make sure I understand it. :) is this only for “testing” as in unit/integration tests?

williamFalcon · 2020-07-09T11:04:02Z

In ddp_spawn the model is only updated in a subprocess. Thus when .fit() ends the original model is still untrained.
This PR makes sure we restore the weights to that model from the last state of the model instead of the "best"

yukw777 · 2020-07-09T13:14:13Z

@williamFalcon ok and you use the queue to send back the best weight path to the main process in case the user wants the best weights for testing. got it.

pep8speaks · 2020-07-09T14:45:38Z

Hello @williamFalcon! Thanks for updating this PR.

In the file pytorch_lightning/trainer/distrib_data_parallel.py:

Line 569:12: E713 test for membership should be 'not in'

Comment last updated at 2020-07-09 15:22:21 UTC

codecov · 2020-07-09T15:36:01Z

Codecov Report

Merging #2565 into master will increase coverage by 4%.
The diff coverage is 57%.

@@           Coverage Diff           @@
##           master   #2565    +/-   ##
=======================================
+ Coverage      87%     91%    +4%     
=======================================
  Files          70      70            
  Lines        5703    5718    +15     
=======================================
+ Hits         4960    5209   +249     
+ Misses        743     509   -234

williamFalcon added 2 commits July 9, 2020 06:42

added base tests for tpu

19734e6

added base tests for tpu

531b383

mergify bot requested a review from a team July 9, 2020 10:46

williamFalcon changed the title ~~.fit() returns last not best weights~~ .fit() returns last not best weights in ddp_spawn Jul 9, 2020

awaelchli reviewed Jul 9, 2020

View reviewed changes

mergify bot requested a review from a team July 9, 2020 10:51

yukw777 reviewed Jul 9, 2020

View reviewed changes

williamFalcon added 2 commits July 9, 2020 07:01

enable none checkpoint

ec62f79

enable none checkpoint

b41f74f

enable none checkpoint

225f93e

Borda added the feature Is an improvement or enhancement label Jul 9, 2020

Borda approved these changes Jul 9, 2020

View reviewed changes

mergify bot requested a review from a team July 9, 2020 12:40

Borda self-requested a review July 9, 2020 13:24

enable none checkpoint

3551777

williamFalcon added 7 commits July 9, 2020 10:48

enable none checkpoint

c3752ef

enable none checkpoint

9d6b4b2

Merge branch 'master' into last

ada80e7

enable none checkpoint

f6ac3a4

enable none checkpoint

ba6b95a

enable none checkpoint

52299d8

enable none checkpoint

222dd03

williamFalcon merged commit 4bbcfa0 into master Jul 9, 2020

Borda deleted the last branch July 9, 2020 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.fit() returns last not best weights in ddp_spawn #2565

.fit() returns last not best weights in ddp_spawn #2565

williamFalcon commented Jul 9, 2020 •

edited

Loading

awaelchli Jul 9, 2020

williamFalcon Jul 9, 2020

yukw777 left a comment

williamFalcon commented Jul 9, 2020

yukw777 commented Jul 9, 2020

pep8speaks commented Jul 9, 2020 •

edited

Loading

codecov bot commented Jul 9, 2020

.fit() returns last not best weights in ddp_spawn #2565

.fit() returns last not best weights in ddp_spawn #2565

Conversation

williamFalcon commented Jul 9, 2020 • edited Loading

awaelchli Jul 9, 2020

Choose a reason for hiding this comment

williamFalcon Jul 9, 2020

Choose a reason for hiding this comment

yukw777 left a comment

Choose a reason for hiding this comment

williamFalcon commented Jul 9, 2020

yukw777 commented Jul 9, 2020

pep8speaks commented Jul 9, 2020 • edited Loading

Comment last updated at 2020-07-09 15:22:21 UTC

codecov bot commented Jul 9, 2020

Codecov Report

williamFalcon commented Jul 9, 2020 •

edited

Loading

pep8speaks commented Jul 9, 2020 •

edited

Loading