Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early stopping conditioned on metric val_loss isn't recognised when setting the val_check_interval #490

Closed
ryanwongsa opened this issue Nov 10, 2019 · 15 comments · Fixed by #492 or #1458
Labels
bug Something isn't working

Comments

@ryanwongsa
Copy link
Contributor

ryanwongsa commented Nov 10, 2019

Describe the bug
Training stops when setting val_check_interval<1.0 in the Trainer class as it doesn't recognise val_loss. I get the following warning at the end of the 3rd epoch:

Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,train_loss

To Reproduce
Steps to reproduce the behavior:

  1. Run the CoolModel example but change the trainer line to
    trainer = Trainer(val_check_interval=0.5,default_save_path="test")
  2. Training will stop at the end of the third epoch and the above warning will show.

Expected behavior
Training shouldn't stop and val_loss should be recognised.

Desktop (please complete the following information):

  • VM: Google Colab
  • Version 0.5.3.2

Additional context
This doesn't happen with 0.5.2.1 although it looks like something has changed with model saving mechanism since it only seems to save the best model in 0.5.3.2.

EDIT: Also seems to happen when setting train_percent_check<1.0

@ryanwongsa ryanwongsa added the bug Something isn't working label Nov 10, 2019
@williamFalcon
Copy link
Contributor

can you post your test_end step?

@ryanwongsa
Copy link
Contributor Author

I didn't use a test set since it is optional. The default MNIST example in the README will reproduce the behaviour when changing the trainer line to:

trainer = Trainer(val_check_interval=0.5,default_save_path="log_dir")
# or
trainer = Trainer(train_percent_check=0.5,default_save_path="log_dir")

@williamFalcon
Copy link
Contributor

sorry, meant validation_end

@ryanwongsa
Copy link
Contributor Author

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

I tried changing 'avg_val_loss' -> 'val_loss' but the same issue occurs.

@williamFalcon
Copy link
Contributor

it should be val_loss

@ryanwongsa
Copy link
Contributor Author

I tried it with val_loss too.

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

The issue still occurs.

The issue only doesn't happen when using the default val_check_interval and train_percent_check in the Trainer.

@williamFalcon
Copy link
Contributor

ok got it. can you share the stacktrace?

@ryanwongsa
Copy link
Contributor Author

ryanwongsa commented Nov 11, 2019

There is no error just a warning at the end of epoch 3 and then training stops.

Epoch 3: : 1894batch [00:04, 403.95batch/s, batch_nb=18, loss=1.014, v_nb=0] /usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,train_loss
  RuntimeWarning)

@kuynzereb
Copy link
Contributor

It looks like the problem is that there is only one self.callback_metrics which is sometimes overwritten by self.run_training_batch and sometimes by self.run_evaluation. At the same time, early stopping callback uses self.callback_metrics at the end of the training epoch. And the problem is that there can be no validation run at the last training batch. In that case self.callback_metrics will contain only the metrics from the last training batch.

If it is true, we can just force validation computation at the end of the training epoch.

@williamFalcon
Copy link
Contributor

@kuynzereb we shouldn't force computation. just partition self.callback_metrics to have

self.callback_metrics['train'] = {}
self.callback_metrics['val'] = {}
self.callback_metrics['test'] = {}

anyone interested in the PR?

@ryanwongsa
Copy link
Contributor Author

I created a PR #492 but made a simple change to update self.callback_metrics instead as then it won't require changes to the EarlyStopping callback. It also seems more consistent with how the other logging metrics are updated.

@S-aiueo32
Copy link
Contributor

S-aiueo32 commented Dec 2, 2019

@williamFalcon @ryanwongsa
Had this issue been fixed?
I'm facing the same issue after changing my local package along with #492.
BTW, I don't have any validation loops and early stopping callbacks.

@jamesjjcondon
Copy link
Contributor

FYI, I was still having this issue which I traced to not having enough trainer.overfit_pc to check relative to my batch-size and num gpus. validation sanity checks and validation end seemed to get skipped (if I ran without early stopping) thereby not returning my loss metrics dict. solved purely by increasing overfit_pc.

@baldassarreFe
Copy link
Contributor

I encountered the same problem when I set check_val_every_n_epoch>1 in the Trainer.
Validation would only be run after some training epochs, but after the first training epoch, this check is done:

def check_metrics(self, logs):
    monitor_val = logs.get(self.monitor)
    error_msg = (f'Early stopping conditioned on metric `{self.monitor}`'
                 f' which is not available. Available metrics are:'
                 f' `{"`, `".join(list(logs.keys()))}`')

    if monitor_val is None:
        if self.strict:
            raise RuntimeError(error_msg)
        if self.verbose > 0:
            rank_zero_warn(error_msg, RuntimeWarning)

        return False

    return True

And if the strict parameter is set (as default), the trainer terminates with that exception.

I think the problem is that EarlyStopping checks the presence of a validation metric on_epoch_end rather than on_validation_end. If validation is performed at the end of every epoch this is not a problem, but if one tries to run validation less often it becomes a problem. One could set strict to False, but I think users should get a warning if the validation metric they try to monitor is not present after a validation run.

Instead, a good solution is to make EarlyStopping use on_validation_end instead of on_epoch_end. I believe this was the intention of EarlyStopping from the beginning. I'm opening a PR to discuss this quick fix.

baldassarreFe added a commit to baldassarreFe/pytorch-lightning that referenced this issue Apr 11, 2020
`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. 
In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.
Borda pushed a commit to baldassarreFe/pytorch-lightning that referenced this issue May 12, 2020
`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. 
In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.
Borda pushed a commit to baldassarreFe/pytorch-lightning that referenced this issue May 25, 2020
`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. 
In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.
mergify bot pushed a commit that referenced this issue May 25, 2020
* Fixes #490

`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. 
In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.

* Highlighted that ES callback runs on val epochs in docstring

* Updated EarlyStopping in rst doc

* Update early_stopping.py

* Update early_stopping.rst

* Update early_stopping.rst

* Update early_stopping.rst

* Update early_stopping.rst

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update docs/source/early_stopping.rst

* fix doctest indentation warning

* Train loop calls early_stop.on_validation_end

* chlog

Co-authored-by: William Falcon <waf2107@columbia.edu>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka <jirka@pytorchlightning.ai>
@csipapicsa
Copy link

csipapicsa commented May 12, 2022

I had a same problem, I did a dummy sampling (ex.: X_test = X_test[10000:11000] for the validation set with too high numbers, it wasn't that long, caused an totally empty set, and of course the NN was not able to validate about the nothing.

Maybe a Warning message would be good, if the test/train sets are empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
7 participants