Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto_lr_find=True doesn't work with early_stop_callback #1685

Closed
hirune924 opened this issue May 1, 2020 · 1 comment
Closed

auto_lr_find=True doesn't work with early_stop_callback #1685

hirune924 opened this issue May 1, 2020 · 1 comment
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@hirune924
Copy link

🐛 Bug

When I use auto_lr_find=True with early_stop_callback I find errors like this.

Traceback (most recent call last):
  File "gpu_template.py", line 92, in <module>
    main(hyperparams)
  File "gpu_template.py", line 53, in main
    trainer.fit(model)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 734, in fit
    self._run_lr_finder_internally(model)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py", line 31, in _run_lr_finder_internally
    lr_finder = self.lr_find(model)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/lr_finder.py", line 164, in lr_find
    self.restore(str(save_path), on_gpu=self.on_gpu)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 289, in restore
    self.restore_training_state(checkpoint)
  File "/home/hirune/anaconda3/envs/PANDA/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 372, in restore_training_state
    self.early_stop_callback.wait = checkpoint['early_stop_callback_wait']
KeyError: 'early_stop_callback_wait'

To Reproduce

Execute the following code like this.

python gpu_template.py --gpus=0

This source code is a slightly modified version of this code.
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/gpu_template.py

Code sample

"""
Runs a model on a single node across multiple gpus.
"""
import os
from argparse import ArgumentParser

import numpy as np
import torch

import pytorch_lightning as pl
from pl_examples.models.lightning_template import LightningTemplateModel
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.callbacks import EarlyStopping

SEED = 2334
torch.manual_seed(SEED)
np.random.seed(SEED)


def main(hparams):
    """
    Main training routine specific for this project
    :param hparams:
    """
    early_stop_callback = EarlyStopping(
        monitor='val_loss',
        patience=20,
        min_delta = 0.0,
        strict=True,
        verbose=True,
        mode='min'
    )
    # ------------------------
    # 1 INIT LIGHTNING MODEL
    # ------------------------
    model = LightningTemplateModel(hparams)

    # ------------------------
    # 2 INIT TRAINER
    # ------------------------
    trainer = pl.Trainer(
        max_epochs=hparams.epochs,
        gpus=hparams.gpus,
        distributed_backend=hparams.distributed_backend,
        precision=16 if hparams.use_16bit else 32,
        auto_lr_find=True,
        early_stop_callback=early_stop_callback,
    )

    # ------------------------
    # 3 START TRAINING
    # ------------------------
    trainer.fit(model)


if __name__ == '__main__':
    # ------------------------
    # TRAINING ARGUMENTS
    # ------------------------
    # these are project-wide arguments

    root_dir = os.path.dirname(os.path.realpath(__file__))
    parent_parser = ArgumentParser(add_help=False)

    # gpu args
    parent_parser.add_argument(
        '--gpus',
        type=int,
        default=2,
        help='how many gpus'
    )
    parent_parser.add_argument(
        '--distributed_backend',
        type=str,
        default='dp',
        help='supports three options dp, ddp, ddp2'
    )
    parent_parser.add_argument(
        '--use_16bit',
        dest='use_16bit',
        action='store_true',
        help='if true uses 16 bit precision'
    )

    # each LightningModule defines arguments relevant to it
    parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
    hyperparams = parser.parse_args()

    # ---------------------
    # RUN TRAINING
    # ---------------------
    main(hyperparams)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

Expected behavior

Environment

  • CUDA:
    • GPU:
      • GeForce GTX 1080 Ti
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.1
    • pyTorch_debug: False
    • pyTorch_version: 1.5.0
    • pytorch-lightning: 0.7.5
    • tensorboard: 2.2.1
    • tqdm: 4.42.1
  • System:

Additional context

@hirune924 hirune924 added bug Something isn't working help wanted Open to be worked on labels May 1, 2020
@hirune924
Copy link
Author

This appears to have fixed this problem.
#1676

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

1 participant