Skip to content

Commit

Permalink
Merge branch 'master' into bugfix_6153
Browse files Browse the repository at this point in the history
  • Loading branch information
tchaton authored Mar 5, 2021
2 parents f5392d4 + b6aa350 commit 0f7ed50
Show file tree
Hide file tree
Showing 24 changed files with 339 additions and 109 deletions.
38 changes: 22 additions & 16 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,13 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](https://github.com/PyTorchLightning/pytorch-lightning/pull/6072))


- Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](https://github.com/PyTorchLightning/pytorch-lightning/pull/6274))

- Added `LightningEnvironment` for Lightning-specific DDP ([#5915](https://github.com/PyTorchLightning/pytorch-lightning/pull/5915))

### Changed

- Changed the order of `backward`, `step`, `zero_grad` to `zero_grad`, `backward`, `step` ([#6147](https://github.com/PyTorchLightning/pytorch-lightning/pull/6147))

- Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](https://github.com/PyTorchLightning/pytorch-lightning/pull/6274))

- Changed default for DeepSpeed CPU Offload to False, due to prohibitively slow speeds at smaller scale ([#6262](https://github.com/PyTorchLightning/pytorch-lightning/pull/6262))

### Changed

- Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](https://github.com/PyTorchLightning/pytorch-lightning/pull/6259))

Expand Down Expand Up @@ -71,22 +68,19 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Expose DeepSpeed loss parameters to allow users to fix loss instability ([#6115](https://github.com/PyTorchLightning/pytorch-lightning/pull/6115))


- Fixed epoch level schedulers not being called when `val_check_interval < 1.0` ([#6075](https://github.com/PyTorchLightning/pytorch-lightning/pull/6075))


- Fixed multiple early stopping callbacks ([#6197](https://github.com/PyTorchLightning/pytorch-lightning/pull/6197))
- Fixed `ModelPruning(make_pruning_permanent=True)` pruning buffers getting removed when saved during training ([#6073](https://github.com/PyTorchLightning/pytorch-lightning/pull/6073))


- Fixed `ModelPruning(make_pruning_permanent=True)` pruning buffers getting removed when saved during training ([#6073](https://github.com/PyTorchLightning/pytorch-lightning/pull/6073))
- Fixed `trainer.test` from `best_path` hangs after calling `trainer.fit` ([#6272](https://github.com/PyTorchLightning/pytorch-lightning/pull/6272))


- Fixed incorrect usage of `detach()`, `cpu()`, `to()` ([#6216](https://github.com/PyTorchLightning/pytorch-lightning/pull/6216))
- Fixed duplicate logs appearing in console when using the python logging module ([#5509](https://github.com/PyTorchLightning/pytorch-lightning/pull/5509), [#6275](https://github.com/PyTorchLightning/pytorch-lightning/pull/6275))


- Fixed LBFGS optimizer support which didn't converge in automatic optimization ([#6147](https://github.com/PyTorchLightning/pytorch-lightning/pull/6147))
- Fixed `SingleTPU` calling `all_gather` ([#6296](https://github.com/PyTorchLightning/pytorch-lightning/pull/6296))


- Prevent `WandbLogger` from dropping values ([#5931](https://github.com/PyTorchLightning/pytorch-lightning/pull/5931))
- Fixed DP reduction with collection ([#6324](https://github.com/PyTorchLightning/pytorch-lightning/pull/6324))


- Fixed PyTorch Profiler with `emit_nvtx` ([#6260](https://github.com/PyTorchLightning/pytorch-lightning/pull/6260))
Expand All @@ -95,12 +89,24 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed `trainer.test` from `best_path` hangs after calling `trainer.fit` ([#6272](https://github.com/PyTorchLightning/pytorch-lightning/pull/6272))


- Fixed duplicate logs appearing in console when using the python logging module ([#5509](https://github.com/PyTorchLightning/pytorch-lightning/pull/5509), [#6275](https://github.com/PyTorchLightning/pytorch-lightning/pull/6275))
## [1.2.2] - 2021-03-02

### Added

- Fixed `SingleTPU` calling `all_gather` ([#6296](https://github.com/PyTorchLightning/pytorch-lightning/pull/6296))
- Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](https://github.com/PyTorchLightning/pytorch-lightning/pull/6072))

### Changed

- Changed the order of `backward`, `step`, `zero_grad` to `zero_grad`, `backward`, `step` ([#6147](https://github.com/PyTorchLightning/pytorch-lightning/pull/6147))
- Changed default for DeepSpeed CPU Offload to False, due to prohibitively slow speeds at smaller scale ([#6262](https://github.com/PyTorchLightning/pytorch-lightning/pull/6262))

### Fixed

- Fixed epoch level schedulers not being called when `val_check_interval < 1.0` ([#6075](https://github.com/PyTorchLightning/pytorch-lightning/pull/6075))
- Fixed multiple early stopping callbacks ([#6197](https://github.com/PyTorchLightning/pytorch-lightning/pull/6197))
- Fixed incorrect usage of `detach()`, `cpu()`, `to()` ([#6216](https://github.com/PyTorchLightning/pytorch-lightning/pull/6216))
- Fixed LBFGS optimizer support which didn't converge in automatic optimization ([#6147](https://github.com/PyTorchLightning/pytorch-lightning/pull/6147))
- Prevent `WandbLogger` from dropping values ([#5931](https://github.com/PyTorchLightning/pytorch-lightning/pull/5931))
- Fixed error thrown when using valid distributed mode in multi node ([#6297](https://github.com/PyTorchLightning/pytorch-lightning/pull/6297)


Expand Down
1 change: 1 addition & 0 deletions pytorch_lightning/plugins/environments/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment # noqa: F401
from pytorch_lightning.plugins.environments.lightning_environment import LightningEnvironment # noqa: F401
from pytorch_lightning.plugins.environments.slurm_environment import SLURMEnvironment # noqa: F401
from pytorch_lightning.plugins.environments.torchelastic_environment import TorchElasticEnvironment # noqa: F401
31 changes: 20 additions & 11 deletions pytorch_lightning/plugins/environments/cluster_environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,24 +11,33 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
from typing import Optional


class ClusterEnvironment:
class ClusterEnvironment(ABC):
""" Specification of a cluster environment. """

def __init__(self):
self._world_size = None
@abstractmethod
def creates_children(self) -> bool:
""" Whether the environment creates the subprocesses or not. """

def master_address(self):
pass
@abstractmethod
def master_address(self) -> str:
""" The master address through which all processes connect and communicate. """

def master_port(self):
pass
@abstractmethod
def master_port(self) -> int:
""" An open and configured port in the master node through which all processes communicate. """

def world_size(self) -> int:
return self._world_size
@abstractmethod
def world_size(self) -> Optional[int]:
""" The number of processes across all devices and nodes. """

@abstractmethod
def local_rank(self) -> int:
pass
""" The rank (index) of the currently running process inside of the current node. """

@abstractmethod
def node_rank(self) -> int:
pass
""" The rank (index) of the node on which the current process runs. """
71 changes: 71 additions & 0 deletions pytorch_lightning/plugins/environments/lightning_environment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import socket
from typing import Optional

from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment


class LightningEnvironment(ClusterEnvironment):
"""
The default environment used by Lightning for a single node or free cluster (not managed).
The master process must be launched by the user and Lightning will spawn new
worker processes for distributed training, either in a single node or across multiple nodes.
If the master address and port are not provided, the default environment will choose them
automatically. It is recommended to use this default environment for single-node distributed
training as it provides the most convenient way to launch the training script.
"""

def __init__(self):
super().__init__()
self._master_port = None

def creates_children(self) -> bool:
return False

def master_address(self) -> str:
return os.environ.get("MASTER_ADDR", "127.0.0.1")

def master_port(self) -> int:
if self._master_port is None:
self._master_port = os.environ.get("MASTER_PORT", find_free_network_port())
return int(self._master_port)

def world_size(self) -> Optional[int]:
return None

def local_rank(self) -> int:
return int(os.environ.get("LOCAL_RANK", 0))

def node_rank(self) -> int:
group_rank = os.environ.get("GROUP_RANK", 0)
return int(os.environ.get("NODE_RANK", group_rank))


def find_free_network_port() -> int:
"""
Finds a free port on localhost.
It is useful in single-node training when we don't want to connect to a real master node but
have to set the `MASTER_PORT` environment variable.
"""
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("", 0))
s.listen(1)
port = s.getsockname()[1]
s.close()
return port
17 changes: 10 additions & 7 deletions pytorch_lightning/plugins/environments/slurm_environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,10 @@ class SLURMEnvironment(ClusterEnvironment):
def __init__(self):
super().__init__()

def master_address(self):
def creates_children(self) -> bool:
return True

def master_address(self) -> str:
# figure out the root node addr
slurm_nodelist = os.environ.get("SLURM_NODELIST")
if slurm_nodelist:
Expand All @@ -39,7 +42,7 @@ def master_address(self):
log.debug(f"MASTER_ADDR: {os.environ['MASTER_ADDR']}")
return root_node

def master_port(self):
def master_port(self) -> int:
# -----------------------
# SLURM JOB = PORT number
# -----------------------
Expand All @@ -64,18 +67,18 @@ def master_port(self):

log.debug(f"MASTER_PORT: {os.environ['MASTER_PORT']}")

return default_port
return int(default_port)

def world_size(self):
return self._world_size
return None

def local_rank(self):
def local_rank(self) -> int:
return int(os.environ['SLURM_LOCALID'])

def node_rank(self):
def node_rank(self) -> int:
return int(os.environ['SLURM_NODEID'])

def resolve_root_node_address(self, root_node):
def resolve_root_node_address(self, root_node: str) -> str:
if '[' in root_node:
name, numbers = root_node.split('[', maxsplit=1)
number = numbers.split(',', maxsplit=1)[0]
Expand Down
17 changes: 11 additions & 6 deletions pytorch_lightning/plugins/environments/torchelastic_environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import logging
import os
from typing import Optional

from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment
from pytorch_lightning.utilities import rank_zero_warn
Expand All @@ -26,27 +27,31 @@ class TorchElasticEnvironment(ClusterEnvironment):
def __init__(self):
super().__init__()

def master_address(self):
def creates_children(self) -> bool:
return True

def master_address(self) -> str:
if "MASTER_ADDR" not in os.environ:
rank_zero_warn("MASTER_ADDR environment variable is not defined. Set as localhost")
os.environ["MASTER_ADDR"] = "127.0.0.1"
log.debug(f"MASTER_ADDR: {os.environ['MASTER_ADDR']}")
master_address = os.environ.get('MASTER_ADDR')
return master_address

def master_port(self):
def master_port(self) -> int:
if "MASTER_PORT" not in os.environ:
rank_zero_warn("MASTER_PORT environment variable is not defined. Set as 12910")
os.environ["MASTER_PORT"] = "12910"
log.debug(f"MASTER_PORT: {os.environ['MASTER_PORT']}")

port = os.environ.get('MASTER_PORT')
port = int(os.environ.get('MASTER_PORT'))
return port

def world_size(self):
return os.environ.get('WORLD_SIZE')
def world_size(self) -> Optional[int]:
world_size = os.environ.get('WORLD_SIZE')
return int(world_size) if world_size is not None else world_size

def local_rank(self):
def local_rank(self) -> int:
return int(os.environ['LOCAL_RANK'])

def node_rank(self) -> int:
Expand Down
23 changes: 6 additions & 17 deletions pytorch_lightning/plugins/training_type/ddp.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,20 +30,14 @@
from pytorch_lightning.plugins.environments.cluster_environment import ClusterEnvironment
from pytorch_lightning.plugins.training_type.parallel import ParallelPlugin
from pytorch_lightning.utilities import _HYDRA_AVAILABLE, _TORCH_GREATER_EQUAL_1_7, rank_zero_warn
from pytorch_lightning.utilities.distributed import (
find_free_network_port,
rank_zero_only,
ReduceOp,
sync_ddp_if_available,
)
from pytorch_lightning.utilities.distributed import rank_zero_only, ReduceOp, sync_ddp_if_available
from pytorch_lightning.utilities.exceptions import MisconfigurationException
from pytorch_lightning.utilities.seed import seed_everything

if _HYDRA_AVAILABLE:
from hydra.core.hydra_config import HydraConfig
from hydra.utils import get_original_cwd, to_absolute_path


log = logging.getLogger(__name__)


Expand Down Expand Up @@ -90,8 +84,7 @@ def setup(self, model):
self._model = model

# start the other scripts
# TODO: refactor and let generic cluster env hold the information about who spawns the processes
if os.environ.get("PL_IN_DDP_SUBPROCESS", "0") != "1":
if not self.cluster_environment.creates_children() and os.environ.get("PL_IN_DDP_SUBPROCESS", "0") != "1":
self._call_children_scripts()

# set the task idx
Expand All @@ -105,15 +98,12 @@ def _call_children_scripts(self):
self._has_spawned_children = True

# DDP Environment variables
os.environ["MASTER_ADDR"] = os.environ.get("MASTER_ADDR", "127.0.0.1")
os.environ["MASTER_PORT"] = os.environ.get("MASTER_PORT", str(find_free_network_port()))
os.environ["MASTER_ADDR"] = self.cluster_environment.master_address()
os.environ["MASTER_PORT"] = str(self.cluster_environment.master_port())

# allow the user to pass the node rank
node_rank = "0"
node_rank = os.environ.get("NODE_RANK", node_rank)
node_rank = os.environ.get("GROUP_RANK", node_rank)
os.environ["NODE_RANK"] = node_rank
os.environ["LOCAL_RANK"] = "0"
os.environ["NODE_RANK"] = str(self.cluster_environment.node_rank())
os.environ["LOCAL_RANK"] = str(self.cluster_environment.local_rank())

# when user is using hydra find the absolute path
path_lib = os.path.abspath if not _HYDRA_AVAILABLE else to_absolute_path
Expand Down Expand Up @@ -209,7 +199,6 @@ def determine_ddp_device_ids(self):
return [self.root_device.index]

def init_ddp_connection(self, global_rank: int, world_size: int) -> None:
# TODO: From where to get cluster environment?
os.environ["MASTER_ADDR"] = str(self.cluster_environment.master_address())
os.environ["MASTER_PORT"] = str(self.cluster_environment.master_port())
os.environ["WORLD_SIZE"] = str(self.cluster_environment.world_size())
Expand Down
12 changes: 3 additions & 9 deletions pytorch_lightning/plugins/training_type/ddp_spawn.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,7 @@
from pytorch_lightning.utilities import _TORCH_GREATER_EQUAL_1_7
from pytorch_lightning.utilities.cloud_io import atomic_save
from pytorch_lightning.utilities.cloud_io import load as pl_load
from pytorch_lightning.utilities.distributed import (
find_free_network_port,
rank_zero_only,
rank_zero_warn,
ReduceOp,
sync_ddp_if_available,
)
from pytorch_lightning.utilities.distributed import rank_zero_only, rank_zero_warn, ReduceOp, sync_ddp_if_available
from pytorch_lightning.utilities.seed import seed_everything

log = logging.getLogger(__name__)
Expand Down Expand Up @@ -84,7 +78,7 @@ def distributed_sampler_kwargs(self):
def setup(self, model):
self._model = model

os.environ["MASTER_PORT"] = os.environ.get("MASTER_PORT", str(find_free_network_port()))
os.environ["MASTER_PORT"] = str(self.cluster_environment.master_port())

# pass in a state q
smp = mp.get_context("spawn")
Expand All @@ -93,7 +87,7 @@ def setup(self, model):
def set_world_ranks(self, process_idx):
self.local_rank = process_idx
self.node_rank = self.cluster_environment.node_rank()
self.task_idx = self.cluster_local_rank
self.task_idx = self.cluster_environment.local_rank()
self.global_rank = self.node_rank * self.num_processes + self.local_rank
self.world_size = self.num_nodes * self.num_processes

Expand Down
Loading

0 comments on commit 0f7ed50

Please sign in to comment.