Merge branch 'master' into tensorboard-logger-ddp-fix--1375

Lightning-AI · Apr 5, 2020 · 2b49800 · 2b49800
2 parents 7a7eba3 + b18accc
commit 2b49800
Show file tree

Hide file tree

Showing 9 changed files with 163 additions and 83 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -26,6 +26,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added model configuration checking ([#1199](https://github.com/PyTorchLightning/pytorch-lightning/pull/1199))
 - Added support for optimizer frequencies through `LightningModule.configure_optimizers()` ([#1269](https://github.com/PyTorchLightning/pytorch-lightning/pull/1269))
 - Added option to run without an optimizer by returning `None` from `configure_optimizers`. ([#1279](https://github.com/PyTorchLightning/pytorch-lightning/pull/1279))
+- Added a warning when the number of data loader workers is small. ([#1378](https://github.com/PyTorchLightning/pytorch-lightning/pull/1378))
 
 ### Changed
 
@@ -42,6 +43,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Enhanced load_from_checkpoint to also forward params to the model ([#1307](https://github.com/PyTorchLightning/pytorch-lightning/pull/1307))
 - Made `evaluate` method private >> `Trainer._evaluate(...)`. ([#1260](https://github.com/PyTorchLightning/pytorch-lightning/pull/1260))
 - Simplify the PL examples structure (shallower and more readable) ([#1247](https://github.com/PyTorchLightning/pytorch-lightning/pull/1247))
+- Changed min max gpu memory to be on their own plots ([#1358](https://github.com/PyTorchLightning/pytorch-lightning/pull/1358))
+- Remove `.item` which causes sync issues ([#1254](https://github.com/PyTorchLightning/pytorch-lightning/pull/1254))
+- Changed smoothing in TQDM to decrease variability of time remaining between training / eval ([#1194](https://github.com/PyTorchLightning/pytorch-lightning/pull/1194))
+- Change default logger to dedicated one ([#1064](https://github.com/PyTorchLightning/pytorch-lightning/pull/1064))
 
 ### Deprecated
 
@@ -56,18 +61,26 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Fixed
 
-- `Trainer.add_argparse_args` classmethod fixed. Now it adds a type for the arguments ([#1147](https://github.com/PyTorchLightning/pytorch-lightning/pull/1147)).
+- Fixed `model_checkpoint` when saving all models ([#1359](https://github.com/PyTorchLightning/pytorch-lightning/pull/1359))
+- `Trainer.add_argparse_args` classmethod fixed. Now it adds a type for the arguments ([#1147](https://github.com/PyTorchLightning/pytorch-lightning/pull/1147))
 - Fixed bug related to type cheking of `ReduceLROnPlateau` lr schedulers([#1114](https://github.com/PyTorchLightning/pytorch-lightning/issues/1114))
 - Fixed a bug to ensure lightning checkpoints to be backward compatible ([#1132](https://github.com/PyTorchLightning/pytorch-lightning/pull/1132))
 - Fixed a bug that created an extra dataloader with active `reload_dataloaders_every_epoch` ([#1181](https://github.com/PyTorchLightning/pytorch-lightning/issues/1181)
 - Fixed all warnings and errors in the docs build process ([#1191](https://github.com/PyTorchLightning/pytorch-lightning/pull/1191))
 - Fixed an issue where `val_percent_check=0` would not disable validation ([#1251](https://github.com/PyTorchLightning/pytorch-lightning/pull/1251))
 - Fixed average of incomplete `TensorRunningMean` ([#1309](https://github.com/PyTorchLightning/pytorch-lightning/pull/1309))
-- Fixed `WandbLogger.watch` with `wandb.init()` ([#1311](https://github.com/PyTorchLightning/pytorch-lightning/pull/1311))
 - Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented ([#1235](https://github.com/PyTorchLightning/pytorch-lightning/pull/1235))
 - Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 ([#1375](https://github.com/PyTorchLightning/pytorch-lightning/issues/1375))
 - Fixed a bug that would cause `trainer.test()` to run on the validation set when overloading `validation_epoch_end ` and `test_end` ([#1353](https://github.com/PyTorchLightning/pytorch-lightning/pull/1353))
-- Fixed `WandbLogger.watch` ([#1311](https://github.com/PyTorchLightning/pytorch-lightning/pull/1311))
+- Fixed `WandbLogger.watch` - use of the watch method without importing `wandb` ([#1311](https://github.com/PyTorchLightning/pytorch-lightning/pull/1311))
+- Fixed `WandbLogger` to be used with 'ddp' - allow reinits in sub-processes ([#1149](https://github.com/PyTorchLightning/pytorch-lightning/pull/1149), [#1360](https://github.com/PyTorchLightning/pytorch-lightning/pull/1360))
+- Made `training_epoch_end` behave like `validation_epoch_end` ([#1357](https://github.com/PyTorchLightning/pytorch-lightning/pull/1357))
+- Fixed `fast_dev_run` running validation twice ([#1365](https://github.com/PyTorchLightning/pytorch-lightning/pull/1365))
+- Fixed pickle error from quick patch `__code__` ([#1352](https://github.com/PyTorchLightning/pytorch-lightning/pull/1352))
+- Fixed memory leak on GPU0 ([#1094](https://github.com/PyTorchLightning/pytorch-lightning/pull/1094), [#1349](https://github.com/PyTorchLightning/pytorch-lightning/pull/1349))
+- Fixed checkpointing interval ([#1272](https://github.com/PyTorchLightning/pytorch-lightning/pull/1272)) 
+- Fixed validation and training loops run the partial dataset ([#1192](https://github.com/PyTorchLightning/pytorch-lightning/pull/1192))
+- Fixed running `on_validation_end` only on main process in DDP ([#1125](https://github.com/PyTorchLightning/pytorch-lightning/pull/1125))
 
 ## [0.7.1] - 2020-03-07
 

diff --git a/docs/source/callbacks.rst b/docs/source/callbacks.rst
@@ -7,35 +7,37 @@ Callbacks
 =========
 
 Lightning has a callback system to execute arbitrary code. Callbacks should capture NON-ESSENTIAL
-logic that is NOT required for your LightningModule to run.
+logic that is NOT required for your :class:`~pytorch_lightning.core.LightningModule` to run.
 
 An overall Lightning system should have:
 
 1. Trainer for all engineering
 2. LightningModule for all research code.
 3. Callbacks for non-essential code.
 
-Example
 
-.. code-block:: python
-
-    import pytorch_lightning as pl
-
-    class MyPrintingCallback(pl.Callback):
-
-        def on_init_start(self, trainer):
-            print('Starting to init trainer!')
-
-        def on_init_end(self, trainer):
-            print('trainer is init now')
-
-        def on_train_end(self, trainer, pl_module):
-            print('do something when training ends')
-
-    # pass to trainer
-    trainer = pl.Trainer(callbacks=[MyPrintingCallback()])
-
-We successfully extended functionality without polluting our super clean LightningModule research code
+Example:
+
+.. doctest::
+
+    >>> import pytorch_lightning as pl
+    >>> class MyPrintingCallback(pl.Callback):
+    ...
+    ...     def on_init_start(self, trainer):
+    ...         print('Starting to init trainer!')
+    ...
+    ...     def on_init_end(self, trainer):
+    ...         print('trainer is init now')
+    ...
+    ...     def on_train_end(self, trainer, pl_module):
+    ...         print('do something when training ends')
+    ...
+    >>> trainer = pl.Trainer(callbacks=[MyPrintingCallback()])
+    Starting to init trainer!
+    trainer is init now
+
+We successfully extended functionality without polluting our super clean
+:class:`~pytorch_lightning.core.LightningModule` research code.
 
 ---------
 

diff --git a/docs/source/early_stopping.rst b/docs/source/early_stopping.rst
@@ -11,24 +11,23 @@ Enable Early Stopping
 ---------------------
 There are two ways to enable early stopping.
 
-.. seealso::
-    :class:`~pytorch_lightning.trainer.trainer.Trainer`
+.. doctest::
 
-.. code-block:: python
+    >>> from pytorch_lightning import Trainer
+    >>> from pytorch_lightning.callbacks import EarlyStopping
 
     # A) Set early_stop_callback to True. Will look for 'val_loss'
     # in validation_epoch_end() return dict. If it is not found an error is raised.
-    trainer = Trainer(early_stop_callback=True)
-
+    >>> trainer = Trainer(early_stop_callback=True)
     # B) Or configure your own callback
-    early_stop_callback = EarlyStopping(
-        monitor='val_loss',
-        min_delta=0.00,
-        patience=3,
-        verbose=False,
-        mode='min'
-    )
-    trainer = Trainer(early_stop_callback=early_stop_callback)
+    >>> early_stop_callback = EarlyStopping(
+    ...    monitor='val_loss',
+    ...    min_delta=0.00,
+    ...    patience=3,
+    ...    verbose=False,
+    ...    mode='min'
+    ... )
+    >>> trainer = Trainer(early_stop_callback=early_stop_callback)
 
 In any case, the callback will fall back to the training metrics (returned in
 :meth:`~pytorch_lightning.core.lightning.LightningModule.training_step`,
@@ -37,6 +36,8 @@ looking for a key to monitor if validation is disabled or
 :meth:`~pytorch_lightning.core.lightning.LightningModule.validation_epoch_end`
 is not defined.
 
+.. seealso::
+    :class:`~pytorch_lightning.trainer.trainer.Trainer`
 
 Disable Early Stopping
 ----------------------

diff --git a/pytorch_lightning/callbacks/base.py b/pytorch_lightning/callbacks/base.py
@@ -1,7 +1,9 @@
 r"""
 Callback Base
 =============
-    Abstract base class used to build new callbacks.
+
+Abstract base class used to build new callbacks.
+
 """
 
 import abc

diff --git a/pytorch_lightning/callbacks/early_stopping.py b/pytorch_lightning/callbacks/early_stopping.py
@@ -1,6 +1,7 @@
 r"""
 Early Stopping
 ==============
+
 Stop training when a monitored quantity has stopped improving.
 
 """
@@ -17,31 +18,30 @@ class EarlyStopping(Callback):
     r"""
 
     Args:
-        monitor (str): quantity to be monitored. Default: ``'val_loss'``.
-        min_delta (float): minimum change in the monitored quantity
+        monitor: quantity to be monitored. Default: ``'val_loss'``.
+        min_delta: minimum change in the monitored quantity
             to qualify as an improvement, i.e. an absolute
             change of less than `min_delta`, will count as no
             improvement. Default: ``0``.
-        patience (int): number of epochs with no improvement
+        patience: number of epochs with no improvement
             after which training will be stopped. Default: ``0``.
-        verbose (bool): verbosity mode. Default: ``False``.
-        mode (str): one of {auto, min, max}. In `min` mode,
+        verbose: verbosity mode. Default: ``False``.
+        mode: one of {auto, min, max}. In `min` mode,
             training will stop when the quantity
             monitored has stopped decreasing; in `max`
             mode it will stop when the quantity
             monitored has stopped increasing; in `auto`
             mode, the direction is automatically inferred
             from the name of the monitored quantity. Default: ``'auto'``.
-        strict (bool): whether to crash the training if `monitor` is
+        strict: whether to crash the training if `monitor` is
             not found in the metrics. Default: ``True``.
 
     Example::
 
-        from pytorch_lightning import Trainer
-        from pytorch_lightning.callbacks import EarlyStopping
-
-        early_stopping = EarlyStopping('val_loss')
-        Trainer(early_stop_callback=early_stopping)
+        >>> from pytorch_lightning import Trainer
+        >>> from pytorch_lightning.callbacks import EarlyStopping
+        >>> early_stopping = EarlyStopping('val_loss')
+        >>> trainer = Trainer(early_stop_callback=early_stopping)
     """
 
     def __init__(self, monitor: str = 'val_loss', min_delta: float = 0.0, patience: int = 0,

diff --git a/pytorch_lightning/callbacks/gradient_accumulation_scheduler.py b/pytorch_lightning/callbacks/gradient_accumulation_scheduler.py
@@ -1,7 +1,9 @@
 r"""
 Gradient Accumulator
 ====================
+
 Change gradient accumulation factor according to scheduling.
+
 """
 
 import warnings
@@ -22,12 +24,15 @@ class GradientAccumulationScheduler(Callback):
 
     Example::
 
-        from pytorch_lightning import Trainer
-        from pytorch_lightning.callbacks import GradientAccumulationScheduler
+        >>> from pytorch_lightning import Trainer
+        >>> from pytorch_lightning.callbacks import GradientAccumulationScheduler
 
         # at epoch 5 start accumulating every 2 batches
-        accumulator = GradientAccumulationScheduler(scheduling: {5: 2})
-        Trainer(accumulate_grad_batches=accumulator)
+        >>> accumulator = GradientAccumulationScheduler(scheduling={5: 2})
+        >>> trainer = Trainer(callbacks=[accumulator])
+
+        # alternatively, pass the scheduling dict directly to the Trainer
+        >>> trainer = Trainer(accumulate_grad_batches={5: 2})
     """
 
     def __init__(self, scheduling: dict):

diff --git a/pytorch_lightning/callbacks/model_checkpoint.py b/pytorch_lightning/callbacks/model_checkpoint.py
@@ -3,6 +3,7 @@
 ===================
 
 Automatically save model checkpoints during training.
+
 """
 
 import os
@@ -26,18 +27,19 @@ class ModelCheckpoint(Callback):
 
             Example::
 
-                # no path
-                ModelCheckpoint()
-                #  saves like /my/path/epoch_0.ckpt
-
-                # save any arbitrary metrics like and val_loss, etc in name
-                ModelCheckpoint(filepath='/my/path/{epoch}-{val_loss:.2f}-{other_metric:.2f}')
-                # saves file like: /my/path/epoch=2-val_loss=0.2_other_metric=0.3.ckpt
+                # custom path
+                # saves a file like: my/path/epoch_0.ckpt
+                >>> checkpoint_callback = ModelCheckpoint('my/path/')
 
+                # save any arbitrary metrics like `val_loss`, etc. in name
+                # saves a file like: my/path/epoch=2-val_loss=0.2_other_metric=0.3.ckpt
+                >>> checkpoint_callback = ModelCheckpoint(
+                ...     filepath='my/path/{epoch}-{val_loss:.2f}-{other_metric:.2f}'
+                ... )
 
-        monitor (str): quantity to monitor.
-        verbose (bool): verbosity mode, False or True.
-        save_top_k (int): if `save_top_k == k`,
+        monitor: quantity to monitor.
+        verbose: verbosity mode. Default: ``False``.
+        save_top_k: if `save_top_k == k`,
             the best k models according to
             the quantity monitored will be saved.
             if ``save_top_k == 0``, no models are saved.
@@ -46,38 +48,41 @@ class ModelCheckpoint(Callback):
             if ``save_top_k >= 2`` and the callback is called multiple
             times inside an epoch, the name of the saved file will be
             appended with a version count starting with `v0`.
-        mode (str): one of {auto, min, max}.
+        mode: one of {auto, min, max}.
             If ``save_top_k != 0``, the decision
             to overwrite the current save file is made
             based on either the maximization or the
             minimization of the monitored quantity. For `val_acc`,
             this should be `max`, for `val_loss` this should
             be `min`, etc. In `auto` mode, the direction is
             automatically inferred from the name of the monitored quantity.
-        save_weights_only (bool): if True, then only the model's weights will be
-            saved (`model.save_weights(filepath)`), else the full model
-            is saved (`model.save(filepath)`).
-        period (int): Interval (number of epochs) between checkpoints.
+        save_weights_only: if ``True``, then only the model's weights will be
+            saved (``model.save_weights(filepath)``), else the full model
+            is saved (``model.save(filepath)``).
+        period: Interval (number of epochs) between checkpoints.
 
     Example::
 
-        from pytorch_lightning import Trainer
-        from pytorch_lightning.callbacks import ModelCheckpoint
+        >>> from pytorch_lightning import Trainer
+        >>> from pytorch_lightning.callbacks import ModelCheckpoint
 
-        # saves checkpoints to my_path whenever 'val_loss' has a new min
-        checkpoint_callback = ModelCheckpoint(filepath='my_path')
-        Trainer(checkpoint_callback=checkpoint_callback)
+        # saves checkpoints to 'my/path/' whenever 'val_loss' has a new min
+        >>> checkpoint_callback = ModelCheckpoint(filepath='my/path/')
+        >>> trainer = Trainer(checkpoint_callback=checkpoint_callback)
 
         # save epoch and val_loss in name
-        ModelCheckpoint(filepath='/my/path/here/sample-mnist_{epoch:02d}-{val_loss:.2f}')
-        # saves file like: /my/path/here/sample-mnist_epoch=02_val_loss=0.32.ckpt
+        # saves a file like: my/path/sample-mnist_epoch=02_val_loss=0.32.ckpt
+        >>> checkpoint_callback = ModelCheckpoint(
+        ...     filepath='my/path/sample-mnist_{epoch:02d}-{val_loss:.2f}'
+        ... )
+
     """
 
-    def __init__(self, filepath, monitor: str = 'val_loss', verbose: bool = False,
+    def __init__(self, filepath: str, monitor: str = 'val_loss', verbose: bool = False,
                  save_top_k: int = 1, save_weights_only: bool = False,
                  mode: str = 'auto', period: int = 1, prefix: str = ''):
         super().__init__()
-        if save_top_k and os.path.isdir(filepath) and len(os.listdir(filepath)) > 0:
+        if save_top_k > 0 and os.path.isdir(filepath) and len(os.listdir(filepath)) > 0:
             warnings.warn(
                 f"Checkpoint directory {filepath} exists and is not empty with save_top_k != 0."
                 "All files in this directory will be deleted when a checkpoint is saved!"
@@ -137,9 +142,10 @@ def check_monitor_top_k(self, current):
         return self.monitor_op(current, self.best_k_models[self.kth_best_model])
 
     def format_checkpoint_name(self, epoch, metrics, ver=None):
-        """Generate a filename according define template.
+        """Generate a filename according to the defined template.
+
+        Example::
 
-        Examples:
             >>> tmpdir = os.path.dirname(__file__)
             >>> ckpt = ModelCheckpoint(os.path.join(tmpdir, '{epoch}'))
             >>> os.path.basename(ckpt.format_checkpoint_name(0, {}))
@@ -213,7 +219,7 @@ def on_validation_end(self, trainer, pl_module):
 
     def _do_check_save(self, filepath, current, epoch):
         # remove kth
-        if len(self.best_k_models) == self.save_top_k:
+        if len(self.best_k_models) == self.save_top_k and self.save_top_k > 0:
             delpath = self.kth_best_model
             self.best_k_models.pop(self.kth_best_model)
             self._del_model(delpath)