Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion over Early Stopping behavior and how it is intended to work in the future #2083

Closed
Dunrar opened this issue Jun 5, 2020 · 2 comments
Labels
question Further information is requested

Comments

@Dunrar
Copy link

Dunrar commented Jun 5, 2020

❓ Questions and Help

What is your question?

So, I intend to use Early Stopping on train_step and training metrics. There were some problems with this (early stopping being called twice in the training loop, not stopping at all when using 'min' mode, not stopping when having no validation, a missing return in the callback class). Those were fixed quickly, but I have some problems with current master still and in #1458 early stopping on training metrics seems to have been disabled, if I understand it correctly. This is also in the 0.8.0-dev documentation. But changing where it is being called is possible.

My question is, will Early Stopping on training metrics be possible going forward? Will calling an Early Stopping subclass in on_train_end catch training metrics and stop training depending on them?

Also, I don't know if I should create another bug report for my current problem with early stopping before #1504 has been merged, which might fix it. I have not changed my code to using a subclass of EarlyStopping, but edited the EarlyStopping class to return self._run_early_stopping_check(trainer, pl_module) in def on_validation_end(self, trainer, pl_module): (which is going to be in #1504 anyway if I understand correctly). Early stopping seems to work now (despite not having a val_step...) but it stops too early again, not before patience has been reached, but clearly before it should.

Code

early_stopping = EarlyStopping(
            monitor='batch/mean_absolute_loss',
            min_delta=hparams.min_delta,
            patience=hparams.patience,
            mode='min'
)

with hparams.patience=150 and hparams.min_delta=0.01, but this happens (epoch/mean_absolute_loss is the mean of all batch/mean_absolute_loss of an epoch, logged in on_epoch_end):

2020-06-05 13_00_04-BAC-4623 rand_RNN_1 Layers_10 Cells_Prediction Column 15_Run 2

Way too early (provided I understand the expected behavior right), is it not?

@Dunrar Dunrar added the question Further information is requested label Jun 5, 2020
@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 5, 2020

yes it will. this is currently being worked on in #1989.

Once this lands you'll add:

# training_step OR validation_step
return TrainResult(loss, early_stop_on=something_else, checkpoint_on=something_else)

@Dunrar
Copy link
Author

Dunrar commented Jun 5, 2020

Thank you! I like it alot!

About current behavior on master, I seem to be able to stop training early on training metrics despite #1458, so that functionality is still there right now, correct? Any idea why my training stops this early?

@Dunrar Dunrar closed this as completed Jun 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants