auto_lr_find=True doesn't work with early_stop_callback #1685

hirune924 · 2020-05-01T04:56:30Z

🐛 Bug

When I use auto_lr_find=True with early_stop_callback I find errors like this.

Traceback (most recent call last):
  File "gpu_template.py", line 92, in <module>
    main(hyperparams)
  File "gpu_template.py", line 53, in main
    trainer.fit(model)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 734, in fit
    self._run_lr_finder_internally(model)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py", line 31, in _run_lr_finder_internally
    lr_finder = self.lr_find(model)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py", line 164, in lr_find
    self.restore(str(save_path), on_gpu=self.on_gpu)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 289, in restore
    self.restore_training_state(checkpoint)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 372, in restore_training_state
    self.early_stop_callback.wait = checkpoint['early_stop_callback_wait']
KeyError: 'early_stop_callback_wait'

To Reproduce

Execute the following code like this.

python gpu_template.py --gpus=0

This source code is a slightly modified version of this code.
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/gpu_template.py

Code sample

"""
Runs a model on a single node across multiple gpus.
"""
import os
from argparse import ArgumentParser

import numpy as np
import torch

import pytorch_lightning as pl
from pl_examples.models.lightning_template import LightningTemplateModel
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.callbacks import EarlyStopping

SEED = 2334
torch.manual_seed(SEED)
np.random.seed(SEED)


def main(hparams):
    """
    Main training routine specific for this project
    :param hparams:
    """
    early_stop_callback = EarlyStopping(
        monitor='val_loss',
        patience=20,
        min_delta = 0.0,
        strict=True,
        verbose=True,
        mode='min'
    )
    # ------------------------
    # 1 INIT LIGHTNING MODEL
    # ------------------------
    model = LightningTemplateModel(hparams)

    # ------------------------
    # 2 INIT TRAINER
    # ------------------------
    trainer = pl.Trainer(
        max_epochs=hparams.epochs,
        gpus=hparams.gpus,
        distributed_backend=hparams.distributed_backend,
        precision=16 if hparams.use_16bit else 32,
        auto_lr_find=True,
        early_stop_callback=early_stop_callback,
    )

    # ------------------------
    # 3 START TRAINING
    # ------------------------
    trainer.fit(model)


if __name__ == '__main__':
    # ------------------------
    # TRAINING ARGUMENTS
    # ------------------------
    # these are project-wide arguments

    root_dir = os.path.dirname(os.path.realpath(__file__))
    parent_parser = ArgumentParser(add_help=False)

    # gpu args
    parent_parser.add_argument(
        '--gpus',
        type=int,
        default=2,
        help='how many gpus'
    )
    parent_parser.add_argument(
        '--distributed_backend',
        type=str,
        default='dp',
        help='supports three options dp, ddp, ddp2'
    )
    parent_parser.add_argument(
        '--use_16bit',
        dest='use_16bit',
        action='store_true',
        help='if true uses 16 bit precision'
    )

    # each LightningModule defines arguments relevant to it
    parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
    hyperparams = parser.parse_args()

    # ---------------------
    # RUN TRAINING
    # ---------------------
    main(hyperparams)

Expected behavior

Environment

CUDA:
- GPU:
  - GeForce GTX 1080 Ti
- available: True
- version: 10.1
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.5.0
- pytorch-lightning: 0.7.5
- tensorboard: 2.2.1
- tqdm: 4.42.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.6
- version: Support for multiple val_dataloaders #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020

Additional context

The text was updated successfully, but these errors were encountered:

hirune924 · 2020-05-01T05:10:01Z

This appears to have fixed this problem.
#1676

hirune924 added bug Something isn't working help wanted Open to be worked on labels May 1, 2020

hirune924 closed this as completed May 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto_lr_find=True doesn't work with early_stop_callback #1685

auto_lr_find=True doesn't work with early_stop_callback #1685

hirune924 commented May 1, 2020

hirune924 commented May 1, 2020

auto_lr_find=True doesn't work with early_stop_callback #1685

auto_lr_find=True doesn't work with early_stop_callback #1685

Comments

hirune924 commented May 1, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

hirune924 commented May 1, 2020