Early stopping conditioned on metric `val_loss` isn't recognised when setting the val_check_interval #490

ryanwongsa · 2019-11-10T21:41:49Z

Describe the bug
Training stops when setting val_check_interval<1.0 in the Trainer class as it doesn't recognise val_loss. I get the following warning at the end of the 3rd epoch:

Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,train_loss

To Reproduce
Steps to reproduce the behavior:

Run the CoolModel example but change the trainer line to
trainer = Trainer(val_check_interval=0.5,default_save_path="test")
Training will stop at the end of the third epoch and the above warning will show.

Expected behavior
Training shouldn't stop and val_loss should be recognised.

Desktop (please complete the following information):

VM: Google Colab
Version 0.5.3.2

Additional context
This doesn't happen with 0.5.2.1 although it looks like something has changed with model saving mechanism since it only seems to save the best model in 0.5.3.2.

EDIT: Also seems to happen when setting train_percent_check<1.0

The text was updated successfully, but these errors were encountered:

williamFalcon · 2019-11-10T22:10:37Z

can you post your test_end step?

ryanwongsa · 2019-11-10T23:53:53Z

I didn't use a test set since it is optional. The default MNIST example in the README will reproduce the behaviour when changing the trainer line to:

trainer = Trainer(val_check_interval=0.5,default_save_path="log_dir")
# or
trainer = Trainer(train_percent_check=0.5,default_save_path="log_dir")

williamFalcon · 2019-11-11T00:56:40Z

sorry, meant validation_end

ryanwongsa · 2019-11-11T07:59:32Z

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

I tried changing 'avg_val_loss' -> 'val_loss' but the same issue occurs.

williamFalcon · 2019-11-11T11:32:58Z

it should be val_loss

ryanwongsa · 2019-11-11T13:21:57Z

I tried it with val_loss too.

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

The issue still occurs.

The issue only doesn't happen when using the default val_check_interval and train_percent_check in the Trainer.

williamFalcon · 2019-11-11T13:38:53Z

ok got it. can you share the stacktrace?

ryanwongsa · 2019-11-11T13:44:25Z

There is no error just a warning at the end of epoch 3 and then training stops.

Epoch 3: : 1894batch [00:04, 403.95batch/s, batch_nb=18, loss=1.014, v_nb=0] /usr/local/lib/python3.6/dist-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,train_loss
  RuntimeWarning)

kuynzereb · 2019-11-11T14:20:06Z

It looks like the problem is that there is only one self.callback_metrics which is sometimes overwritten by self.run_training_batch and sometimes by self.run_evaluation. At the same time, early stopping callback uses self.callback_metrics at the end of the training epoch. And the problem is that there can be no validation run at the last training batch. In that case self.callback_metrics will contain only the metrics from the last training batch.

If it is true, we can just force validation computation at the end of the training epoch.

williamFalcon · 2019-11-11T16:18:28Z

@kuynzereb we shouldn't force computation. just partition self.callback_metrics to have

self.callback_metrics['train'] = {}
self.callback_metrics['val'] = {}
self.callback_metrics['test'] = {}

anyone interested in the PR?

ryanwongsa · 2019-11-11T18:36:49Z

I created a PR #492 but made a simple change to update self.callback_metrics instead as then it won't require changes to the EarlyStopping callback. It also seems more consistent with how the other logging metrics are updated.

S-aiueo32 · 2019-12-02T03:01:04Z

@williamFalcon @ryanwongsa
Had this issue been fixed?
I'm facing the same issue after changing my local package along with #492.
BTW, I don't have any validation loops and early stopping callbacks.

jamesjjcondon · 2020-02-17T03:44:09Z

FYI, I was still having this issue which I traced to not having enough trainer.overfit_pc to check relative to my batch-size and num gpus. validation sanity checks and validation end seemed to get skipped (if I ran without early stopping) thereby not returning my loss metrics dict. solved purely by increasing overfit_pc.

baldassarreFe · 2020-04-11T17:32:49Z

I encountered the same problem when I set check_val_every_n_epoch>1 in the Trainer.
Validation would only be run after some training epochs, but after the first training epoch, this check is done:

def check_metrics(self, logs):
    monitor_val = logs.get(self.monitor)
    error_msg = (f'Early stopping conditioned on metric `{self.monitor}`'
                 f' which is not available. Available metrics are:'
                 f' `{"`, `".join(list(logs.keys()))}`')

    if monitor_val is None:
        if self.strict:
            raise RuntimeError(error_msg)
        if self.verbose > 0:
            rank_zero_warn(error_msg, RuntimeWarning)

        return False

    return True

And if the strict parameter is set (as default), the trainer terminates with that exception.

I think the problem is that EarlyStopping checks the presence of a validation metric on_epoch_end rather than on_validation_end. If validation is performed at the end of every epoch this is not a problem, but if one tries to run validation less often it becomes a problem. One could set strict to False, but I think users should get a warning if the validation metric they try to monitor is not present after a validation run.

Instead, a good solution is to make EarlyStopping use on_validation_end instead of on_epoch_end. I believe this was the intention of EarlyStopping from the beginning. I'm opening a PR to discuss this quick fix.

`EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`.

* Fixes #490 `EarlyStopping` should check the metric of interest `on_validation_end` rather than `on_epoch_end`. In a normal scenario, this does not cause a problem, but in combination with `check_val_every_n_epoch>1` in the `Trainer` it results in a warning or in a `RuntimeError` depending on `strict`. * Highlighted that ES callback runs on val epochs in docstring * Updated EarlyStopping in rst doc * Update early_stopping.py * Update early_stopping.rst * Update early_stopping.rst * Update early_stopping.rst * Update early_stopping.rst * Apply suggestions from code review Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Update docs/source/early_stopping.rst * fix doctest indentation warning * Train loop calls early_stop.on_validation_end * chlog Co-authored-by: William Falcon <waf2107@columbia.edu> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka <jirka@pytorchlightning.ai>

csipapicsa · 2022-05-12T20:19:09Z

I had a same problem, I did a dummy sampling (ex.: X_test = X_test[10000:11000] for the validation set with too high numbers, it wasn't that long, caused an totally empty set, and of course the NN was not able to validate about the nothing.

Maybe a Warning message would be good, if the test/train sets are empty.

ryanwongsa added the bug Something isn't working label Nov 10, 2019

ryanwongsa mentioned this issue Nov 11, 2019

Fixed issue where callback_metrics was replaced instead of updated #492

Merged

4 tasks

williamFalcon closed this as completed in #492 Nov 12, 2019

kuynzereb mentioned this issue Jan 9, 2020

Where is EarlyStopping searching for metrics? #670

Closed

baldassarreFe mentioned this issue Apr 11, 2020

early stopping checks on_validation_end #1458

Merged

5 tasks

Drow999 mentioned this issue Jan 18, 2022

RuntimeError: Early stopping conditioned on metric val_loss which is not available. Pass in or modify your EarlyStopping callback to use any of the following: #11534

Closed

JackRio mentioned this issue Feb 24, 2022

Early stopping conditioned on metric val_loss which is not available #12095

Closed

noamsgl mentioned this issue Jun 1, 2023

Early stopping with val_check_interval and check_val_every_n_epoch stops too early #17736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early stopping conditioned on metric `val_loss` isn't recognised when setting the val_check_interval #490

Early stopping conditioned on metric `val_loss` isn't recognised when setting the val_check_interval #490

ryanwongsa commented Nov 10, 2019 •

edited

Loading

williamFalcon commented Nov 10, 2019

ryanwongsa commented Nov 10, 2019

williamFalcon commented Nov 11, 2019

ryanwongsa commented Nov 11, 2019

williamFalcon commented Nov 11, 2019

ryanwongsa commented Nov 11, 2019

williamFalcon commented Nov 11, 2019

ryanwongsa commented Nov 11, 2019 •

edited

Loading

kuynzereb commented Nov 11, 2019

williamFalcon commented Nov 11, 2019

ryanwongsa commented Nov 11, 2019

S-aiueo32 commented Dec 2, 2019 •

edited

Loading

jamesjjcondon commented Feb 17, 2020

baldassarreFe commented Apr 11, 2020

csipapicsa commented May 12, 2022 •

edited

Loading

Early stopping conditioned on metric val_loss isn't recognised when setting the val_check_interval #490

Early stopping conditioned on metric val_loss isn't recognised when setting the val_check_interval #490

Comments

ryanwongsa commented Nov 10, 2019 • edited Loading

williamFalcon commented Nov 10, 2019

ryanwongsa commented Nov 10, 2019

williamFalcon commented Nov 11, 2019

ryanwongsa commented Nov 11, 2019

williamFalcon commented Nov 11, 2019

ryanwongsa commented Nov 11, 2019

williamFalcon commented Nov 11, 2019

ryanwongsa commented Nov 11, 2019 • edited Loading

kuynzereb commented Nov 11, 2019

williamFalcon commented Nov 11, 2019

ryanwongsa commented Nov 11, 2019

S-aiueo32 commented Dec 2, 2019 • edited Loading

jamesjjcondon commented Feb 17, 2020

baldassarreFe commented Apr 11, 2020

csipapicsa commented May 12, 2022 • edited Loading

Early stopping conditioned on metric `val_loss` isn't recognised when setting the val_check_interval #490

Early stopping conditioned on metric `val_loss` isn't recognised when setting the val_check_interval #490

ryanwongsa commented Nov 10, 2019 •

edited

Loading

ryanwongsa commented Nov 11, 2019 •

edited

Loading

S-aiueo32 commented Dec 2, 2019 •

edited

Loading

csipapicsa commented May 12, 2022 •

edited

Loading