default EarlyStopping callback should not fail on missing val_loss data #524

colllin · 2019-11-18T21:17:18Z

Describe the bug
My training script failed overnight — this is the last thing I see in the logs before the instance shut down:

python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:128: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: avg_val_loss_total,avg_val_jacc10,avg_val_ce
  RuntimeWarning)

It seems like we intended this to be a "warning" but it appears that it interrupted my training script. Do you think that's possible, or could it be something else? I had 2 training scripts running on 2 different instances last night and both shut down in this way, with this RuntimeWarning as the last line in the logs. Is it possible that the default EarlyStopping callback killed my script because I didn't log a val_loss tensor somewhere it could find it? To be clear, it is not my intention to use EarlyStopping at all, so I was quite surprised to wake up today and find my instance shut down and training interrupted, and no clear sign of a bug on my end. Did you intend this to interrupt the trainer? If so, how do we feel about changing that plan so that the default EarlyStopping callback has no effect when it can't find a val_loss metric?

The text was updated successfully, but these errors were encountered:

kuynzereb · 2019-11-19T14:06:45Z

I agree that it is quite unpleasant when the default early stopping callback unexpectedly stops the training because it can't find val_loss. And it is also unpleasant that you will find out the absence of the required metric only at the end of the full first training epoch (and moreover the training will stop). So I would separate this into two different problems:

Default early stopping should not stop the training. We should either disable it if no val_loss was found or even just disable it at all by default.
We should somehow check in the very beginning of the training that the metric required by the early stopping is available after the validation loop. Now it is checked only at the end of the first training epoch and if it is not present the training stops.

AS-researcher6 · 2019-11-20T16:28:49Z

I'm guessing you were doing check_val_every_n_epoch>1.
This error is because callback_metrics is what is used for early stopping. This is cleared and re-filled at every training step logging. A hacky solution I have found is to save the last val_loss as a model attribute self.val_loss, and return at every training step
Ex.
output {
'loss': loss,
'log': log_dict,
'progress_bar': prog_dict,
'val_loss': self.val_loss
}

kuynzereb · 2019-11-20T18:57:08Z

Wow, indeed, there is a third problem:
3) It is not clear how early stopping should work when check_val_every_n_epoch > 1.

However, please note, that now callbacks metrics are not longer replaced by new ones but updated. It was fixed in #492.

kuynzereb · 2019-11-20T19:41:25Z

So I would suggest the following:

By default early stop callback is turned on but if there is no val_loss then we just warn the user that early stop callback will not work, and training will proceed as though there is no early stop callback.
If early stop callback is explicitly specified by the user then we will force validation sanity check and will examine the metrics obtained from it. If the metric required by the early stop callback is not present then we will raise an error.

@williamFalcon, what do you think?

awaelchli · 2019-11-25T09:45:38Z

Isn't it possible that the user returns val_loss only in some epochs, e.g., only every other epoch (intentionally or not)?

kuynzereb · 2019-11-25T09:58:05Z

Yeah, it is the problem if no val_loss is returned in some epochs. In that case early stopping will work quite strange. It happens, for example, if check_val_every_n_epoch > 1.

williamFalcon · 2019-11-25T11:37:38Z

@awaelchli ... maybe modify the early stopping to skip the check when that key is missing?

awaelchli · 2019-11-25T19:02:01Z

very reasonable imo. @kuynzereb do you see any problem with this?

kuynzereb · 2019-11-25T19:54:17Z

Nope, it sound good for me too. But we will need to explicitly remove this key from callback metrics in the start of each epoch otherwise it will be always available (now it always stores the metric from the last validation loop).

will hopefully be fixed (Lightning-AI/pytorch-lightning#524) add 3d model without dropout

williamFalcon · 2019-12-04T12:31:55Z

@awaelchli or @kuynzereb mind submitting a PR?

awaelchli · 2019-12-05T23:01:08Z

I can look into it.

williamFalcon · 2020-01-21T13:33:28Z

@awaelchli any updates?

awaelchli · 2020-01-21T17:00:19Z

lost track of this after I ran into some unexpected behaviors. will try get back to it but it seems @kuynzereb has a better overview of early stopping than me.

kuynzereb · 2020-01-24T08:58:44Z

It seems that we can just add a condition that early_stop_callback.on_epoch_end() should be called only if current_epoch % check_val_every_n_epoch == 0

veritas9872 · 2021-02-10T09:28:29Z

Hello. I am still getting a similar problem. Has this been confirmed as solved?

TuBui · 2021-09-14T09:21:16Z

I confirm the problem re-appears in pytorch-lightning version 1.4.0 and 1.4.1. Early stop callback is always checked at the end of the first epoch, so if check_val_every_n_epoch>1 the job will fail.
Run fine on version 1.3.1 though.

tchaton · 2021-09-14T10:00:24Z

Dear @TuBui, @veritas9872,

Would you mind trying out master ?

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git

If the error persists, please re-open this issue.

Best,
T.C

TuBui · 2021-09-14T11:09:37Z

thanks for the quick update. pytorch-lightning-1.5.0.dev0 (current master branch) works.

yinrong · 2021-10-04T06:31:22Z

@TuBui can not pip install 1.5.0 how should I use it?

TuBui · 2021-10-04T08:17:49Z

@yinrong check tchaton answer.

colllin added the bug Something isn't working label Nov 18, 2019

kuynzereb mentioned this issue Nov 25, 2019

Check early stopping metric in the beginning of the training #542

Merged

fellnerse added a commit to fellnerse/forgerydetection that referenced this issue Nov 27, 2019

disable early stopping; there is a bug when validation percentage is set

2f378d5

will hopefully be fixed (Lightning-AI/pytorch-lightning#524) add 3d model without dropout

awaelchli mentioned this issue Dec 8, 2019

Early Stopping kicks in at min_epochs + 2 instead of min_epochs #606

Closed

williamFalcon added the need fix label Jan 21, 2020

Borda assigned awaelchli Jan 24, 2020

kuynzereb mentioned this issue Jan 24, 2020

Fix for early stopping when check_val_every_n_epoch > 1 #743

Merged

williamFalcon closed this as completed in #743 Jan 24, 2020

awaelchli mentioned this issue Feb 24, 2020

Unrecognized val_loss metric #923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default EarlyStopping callback should not fail on missing val_loss data #524

default EarlyStopping callback should not fail on missing val_loss data #524

colllin commented Nov 18, 2019

kuynzereb commented Nov 19, 2019

AS-researcher6 commented Nov 20, 2019 •

edited

Loading

kuynzereb commented Nov 20, 2019

kuynzereb commented Nov 20, 2019

awaelchli commented Nov 25, 2019 •

edited

Loading

kuynzereb commented Nov 25, 2019

williamFalcon commented Nov 25, 2019 •

edited

Loading

awaelchli commented Nov 25, 2019

kuynzereb commented Nov 25, 2019

williamFalcon commented Dec 4, 2019

awaelchli commented Dec 5, 2019

williamFalcon commented Jan 21, 2020

awaelchli commented Jan 21, 2020

kuynzereb commented Jan 24, 2020

veritas9872 commented Feb 10, 2021

TuBui commented Sep 14, 2021 •

edited

Loading

tchaton commented Sep 14, 2021

TuBui commented Sep 14, 2021

yinrong commented Oct 4, 2021

TuBui commented Oct 4, 2021

default EarlyStopping callback should not fail on missing val_loss data #524

default EarlyStopping callback should not fail on missing val_loss data #524

Comments

colllin commented Nov 18, 2019

kuynzereb commented Nov 19, 2019

AS-researcher6 commented Nov 20, 2019 • edited Loading

kuynzereb commented Nov 20, 2019

kuynzereb commented Nov 20, 2019

awaelchli commented Nov 25, 2019 • edited Loading

kuynzereb commented Nov 25, 2019

williamFalcon commented Nov 25, 2019 • edited Loading

awaelchli commented Nov 25, 2019

kuynzereb commented Nov 25, 2019

williamFalcon commented Dec 4, 2019

awaelchli commented Dec 5, 2019

williamFalcon commented Jan 21, 2020

awaelchli commented Jan 21, 2020

kuynzereb commented Jan 24, 2020

veritas9872 commented Feb 10, 2021

TuBui commented Sep 14, 2021 • edited Loading

tchaton commented Sep 14, 2021

TuBui commented Sep 14, 2021

yinrong commented Oct 4, 2021

TuBui commented Oct 4, 2021

AS-researcher6 commented Nov 20, 2019 •

edited

Loading

awaelchli commented Nov 25, 2019 •

edited

Loading

williamFalcon commented Nov 25, 2019 •

edited

Loading

TuBui commented Sep 14, 2021 •

edited

Loading