Add ddp_cpu backend for testing ddp without GPUs #1158

neggert · 2020-03-15T21:39:22Z

New feature: DDP on CPU. This can be used for distributed training with out GPUs, or to test/debug DDP on single machine without GPUs.. Since PyTorch already makes good use of multiple cores under the hood, this will not provide any speedup over normal CPU training if you're only using a single node.

API changes:

New distributed backend: distributed_backend="ddp_cpu"
New Trainer argument: num_processes. Controls the number or processes per node. Is currently only used by ddp_cpu.

Still need to update the documentation, but I wanted to get some eyes on this before I got too far. Right now everything if functional as far as I can tell, and I've added a new test that covers this feature.

pep8speaks · 2020-03-15T21:39:28Z

Hello @neggert! Thanks for updating this PR.

In the file tests/models/test_cpu.py:

Line 47:52: W504 line break after binary operator

In the file tests/trainer/test_trainer.py:

Line 708:111: E501 line too long (118 > 110 characters)
Line 712:111: E501 line too long (118 > 110 characters)
Line 716:111: E501 line too long (118 > 110 characters)
Line 720:111: E501 line too long (118 > 110 characters)
Line 724:111: E501 line too long (117 > 110 characters)
Line 728:111: E501 line too long (117 > 110 characters)
Line 732:111: E501 line too long (117 > 110 characters)
Line 736:111: E501 line too long (118 > 110 characters)
Line 740:111: E501 line too long (117 > 110 characters)
Line 745:111: E501 line too long (116 > 110 characters)
Line 750:111: E501 line too long (116 > 110 characters)
Line 755:111: E501 line too long (118 > 110 characters)
Line 760:111: E501 line too long (117 > 110 characters)
Line 765:111: E501 line too long (117 > 110 characters)
Line 770:111: E501 line too long (117 > 110 characters)
Line 775:111: E501 line too long (117 > 110 characters)
Line 780:111: E501 line too long (117 > 110 characters)

Comment last updated at 2020-04-11 22:45:26 UTC

codecov · 2020-03-16T02:38:13Z

Codecov Report

Merging #1158 into master will increase coverage by 0%.
The diff coverage is 95%.

@@          Coverage Diff           @@
##           master   #1158   +/-   ##
======================================
  Coverage      91%     91%           
======================================
  Files          67      67           
  Lines        3742    3760   +18     
======================================
+ Hits         3400    3418   +18     
  Misses        342     342

tullie · 2020-03-18T19:48:56Z

Have you considered just making ddp work with cpu, when GPUS=0 or GPUS=None? I think if I was a user that would be the intuitive thing for ddp to do without GPUs.

williamFalcon · 2020-03-18T19:54:36Z

pytorch_lightning/core/lightning.py

@@ -855,11 +855,15 @@ def init_ddp_connection(self):
        try:
            root_node = os.environ['SLURM_NODELIST'].split(' ')[0]
        except Exception:
-            root_node = '127.0.0.2'
+            root_node = '127.0.0.1'


why the IP change?

I wish I knew. 127.0.0.1 works, 127.0.0.2 doesn't.

williamFalcon · 2020-03-18T19:54:50Z

pytorch_lightning/core/lightning.py


        root_node = self.trainer.resolve_root_node_address(root_node)
        os.environ['MASTER_ADDR'] = root_node
-        torch_distrib.init_process_group('nccl', rank=proc_rank, world_size=world_size)
+        if self.trainer.on_gpu:


why the different backends?

NCCL only works for GPU. The PyTorch docs recommend using GLOO for CPU training: https://pytorch.org/docs/stable/distributed.html#backends

williamFalcon · 2020-03-18T19:55:28Z

pytorch_lightning/overrides/data_parallel.py

-            output = self.module(*inputs, **kwargs)
+            # output = self.module(*inputs, **kwargs)
+            # lightning (ddp_cpu)
+            if self.module.training:


why this change?

This is copying the changes you made above for the case where self.device_ids is None. I think previously, we would never hit this code.

neggert · 2020-03-18T20:52:15Z

@tullie As discussed in Slack, I'll change things so that distributed_backend="ddp", gpus=None also gives DDP-on-CPU behavior. (gpus=0 runs on GPU 0 I believe. Don't think we should change that.)

pytorch_lightning/trainer/distrib_data_parallel.py

mergify · 2020-03-24T18:56:17Z

This pull request is now in conflict... :(

Borda · 2020-03-26T15:38:33Z

@neggert how is it going here? need any help? 🐰

williamFalcon · 2020-04-02T17:58:21Z

@neggert can we get this merged so it can go into 0.7.2?

Borda · 2020-04-07T20:03:42Z

pytorch_lightning/trainer/distrib_data_parallel.py

+            if self.num_gpus == 0:
+                pass


why these if ...: pass ?

Just for clarity. I wanted to make clear to the future reader that we're considering this case and intentionally doing nothing.

I would skip it lol

Maybe a comment could achieve the same

Replaced it with a comment

pytorch_lightning/trainer/distrib_data_parallel.py

tests/trainer/test_trainer.py

tests/models/test_cpu.py

williamFalcon · 2020-04-08T12:37:33Z

Is this ready for 0.7.2 or push to 0.7.3?

Borda · 2020-04-08T12:54:35Z

I would leave it for next 0.7.3 @neggert ?

neggert · 2020-04-08T15:14:20Z

Yeah, let's not rush it into a release. 0.7.3 is fine.

Support for distributed in MacOS was added in Torch 1.3.0

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

pytorch_lightning/trainer/trainer.py

pytorch_lightning/core/lightning.py

* Add tests for distributed backend config * Refactor set_distributed_mode * Use gloo backend on cpu * Use 127.0.0.1 instead of 127.0.0.2 Not totally clear on why this is necessary, but it seemt to work * Update LightningDDP so that it works with CPU * Add ddp_cpu backend and num_processes Trainer arg * PEP8 * Fix test skipping. Inequalities are hard :/ * Skip ddp_cpu test on Windows * Make a few more cases fall back to ddp_cpu * New function name * Flake8 * Don't test distributed on MacOS with torch < 1.3 Support for distributed in MacOS was added in Torch 1.3.0 * Add ddp_cpu and num_processes to docs * Parametrize trainer config tests * Tweak warning Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com> * Remove redundant test * Replace pass branches with comments * Add missing warnings import * save_path -> root_dir * Use new rank_zero_warn * Whitespace * Apply suggestions from code review * formatting Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>

neggert requested a review from williamFalcon March 15, 2020 21:39

neggert force-pushed the cpu_ddp branch 2 times, most recently from cac041d to 3f1f0e1 Compare March 18, 2020 14:52

williamFalcon reviewed Mar 18, 2020

View reviewed changes

tullie suggested changes Mar 19, 2020

View reviewed changes

pytorch_lightning/trainer/distrib_data_parallel.py Show resolved Hide resolved

pytorch_lightning/trainer/distrib_data_parallel.py Show resolved Hide resolved

neggert force-pushed the cpu_ddp branch 2 times, most recently from e610a09 to 58f687a Compare March 20, 2020 20:17

Borda changed the title ~~[WIP] Add ddp_cpu backend for testing ddp without CPUs~~ [WIP] Add ddp_cpu backend for testing ddp without GPUs Mar 26, 2020

mergify bot requested a review from a team March 30, 2020 22:33

williamFalcon added this to the 0.7.3 milestone Apr 3, 2020

neggert force-pushed the cpu_ddp branch from 58f687a to 2c90183 Compare April 7, 2020 19:24

Borda reviewed Apr 7, 2020

View reviewed changes

neggert changed the title ~~[WIP] Add ddp_cpu backend for testing ddp without GPUs~~ Add ddp_cpu backend for testing ddp without GPUs Apr 7, 2020

Borda reviewed Apr 7, 2020

View reviewed changes

tests/models/test_cpu.py Show resolved Hide resolved

Borda requested review from tullie, williamFalcon and a team April 7, 2020 22:10

williamFalcon changed the title ~~Add ddp_cpu backend for testing ddp without GPUs~~ [WIP] Add ddp_cpu backend for testing ddp without GPUs Apr 8, 2020

neggert and others added 8 commits April 11, 2020 09:04

New function name

418425f

Flake8

c970a81

Don't test distributed on MacOS with torch < 1.3

1f22303

Support for distributed in MacOS was added in Torch 1.3.0

Add ddp_cpu and num_processes to docs

96ef7d9

Parametrize trainer config tests

008b29e

Tweak warning

89f59bd

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

Remove redundant test

544288e

Replace pass branches with comments

9021dcd

Borda force-pushed the cpu_ddp branch from 2f7d194 to 9021dcd Compare April 11, 2020 07:07

neggert added 4 commits April 11, 2020 16:04

Add missing warnings import

6adf380

save_path -> root_dir

48991c2

Use new rank_zero_warn

b00f172

Whitespace

bd7833d

neggert changed the title ~~[WIP] Add ddp_cpu backend for testing ddp without GPUs~~ Add ddp_cpu backend for testing ddp without GPUs Apr 11, 2020

Borda requested review from justusschock and Borda April 11, 2020 22:32

Borda added feature Is an improvement or enhancement ci Continuous Integration labels Apr 11, 2020

Borda approved these changes Apr 11, 2020

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team April 11, 2020 22:41

Borda and others added 2 commits April 12, 2020 00:41

Apply suggestions from code review

66bbe51

formatting

76316d4

Borda added the ready PRs ready to be merged label Apr 14, 2020

Borda requested a review from jeremyjordan April 14, 2020 09:12

williamFalcon merged commit e3001a0 into Lightning-AI:master Apr 16, 2020

neggert deleted the cpu_ddp branch April 16, 2020 14:54

awaelchli mentioned this pull request Apr 25, 2020

multiprocessing cpu only training #222

Closed

Borda modified the milestones: 0.7.4, v0.7.x Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ddp_cpu backend for testing ddp without GPUs #1158

Add ddp_cpu backend for testing ddp without GPUs #1158

neggert commented Mar 15, 2020

pep8speaks commented Mar 15, 2020 •

edited

Loading

codecov bot commented Mar 16, 2020 •

edited

Loading

tullie commented Mar 18, 2020

williamFalcon Mar 18, 2020

neggert Mar 18, 2020

williamFalcon Mar 18, 2020

neggert Mar 18, 2020

williamFalcon Mar 18, 2020

neggert Mar 18, 2020

neggert commented Mar 18, 2020

mergify bot commented Mar 24, 2020

Borda commented Mar 26, 2020

williamFalcon commented Apr 2, 2020

Borda Apr 7, 2020

neggert Apr 7, 2020

Borda Apr 7, 2020

justusschock Apr 8, 2020

neggert Apr 8, 2020

williamFalcon commented Apr 8, 2020

Borda commented Apr 8, 2020

neggert commented Apr 8, 2020

Add ddp_cpu backend for testing ddp without GPUs #1158

Add ddp_cpu backend for testing ddp without GPUs #1158

Conversation

neggert commented Mar 15, 2020

pep8speaks commented Mar 15, 2020 • edited Loading

Comment last updated at 2020-04-11 22:45:26 UTC

codecov bot commented Mar 16, 2020 • edited Loading

Codecov Report

tullie commented Mar 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neggert commented Mar 18, 2020

mergify bot commented Mar 24, 2020

Borda commented Mar 26, 2020

williamFalcon commented Apr 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamFalcon commented Apr 8, 2020

Borda commented Apr 8, 2020

neggert commented Apr 8, 2020

pep8speaks commented Mar 15, 2020 •

edited

Loading

codecov bot commented Mar 16, 2020 •

edited

Loading