Avoid running `on_training_end` after Keyboard Interrupt #1099

lezwon · 2020-03-09T04:13:38Z

🚀 Feature

Right now, due to #795 , on_training_end runs after a KeyboardInterrupt. This ends up running code that's meant to be run only after successful training completion. This feature should either be reverted, or an alternative should be provided, so as to run some code only after successful training.

Motivation

I am training a model on Sagemaker and have added a notebook shutdown code within the on_training_end method. There were times where I had to manually cancel my model training because some parameters were incorrect. However, If I do that, the notebook shuts down immediately. This is because the on_training_end method runs even after a KeyboardInterrupt. I don't want my notebook shutting down after a keyboard interrupt, only after successful training completion.

Pitch

Maybe add an on_training_completed method for code that's meant to be run after successful training.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-03-09T04:14:18Z

Hi! thanks for your contribution!, great first issue!

Borda · 2020-03-11T23:12:54Z

Good point, it makes sense to skip eval of the unfinished train also when a user is it he doesn't want to wait for another hour lol Mind send a PR? 🤖

jeremyjordan · 2020-03-12T01:26:38Z

Thanks for bringing this use-case to our attention @lezwon , definitely something we'd want to consider.

What if the Trainer object had a status property? Similar to how compute jobs will have a status {PENDING, RUNNING, COMPLETE, FAILED} we could apply something similar here for the training job. In this case we could differentiate the two cases with a status of INTERRUPTED vs COMPLETED and your callback logic can check for the proper status before closing (eg. shutdown notebook on FAILED/COMPLETED but don't shutdown for INTERRUPTED).

What do you think?

lezwon · 2020-03-12T07:33:33Z

@jeremyjordan that sounds great. We could definitely do that.

Borda · 2020-03-12T08:58:23Z

Not sure if we really need to add complexity to the existing callbacks...
Also on_training_completed is standard behavior so rather some cleanup for interrupted?
@jeremyjordan status sounds cool, by my opinion it is much better signal the returned 0/1 from e.g. .fit =)

jeremyjordan · 2020-03-13T03:14:32Z

@Borda are you suggesting that we add a new callback (on_training_completed that @lezwon mentioned) which runs conditionally according to the value returned from fit()? My only worry is that it might be confusing to know the difference between on_training_end (existing) and on_training_complete (proposed).

Borda · 2020-03-13T13:41:49Z

in this moment I was just thinking about adding Trainer status...
not sure if we want to add complexity with new callback...

jeremyjordan · 2020-03-14T15:39:10Z

Ok yes I agree, I think the Trainer status would be a simple solution that may also be useful in other situations as well :)

Borda · 2020-03-14T18:13:09Z

Cool, does anyone send a PR with Trainer status? I may suggest to implement it as enum/numbering similar like logging.INFO/DEBUG/...

lezwon · 2020-03-15T05:40:06Z

Will this status also account for validation/test completed/interrupted?

jeremyjordan · 2020-03-15T14:05:21Z

yes we could do something like:

>>> from enum import Enum
>>> class TrainerStatus(Enum):
...     PENDING = "Initializing Trainer object"
...     TRAINING = "Optimizing model via train loop"
...     VALIDATING = "Running model on validation set"
...     TESTING = "Running model on test set"
...     FAILED = "Trainer failed to complete a successful run"
...     INTERRUPTED = "Training was interrupted by the user"
...     COMPLETED = "Training completed successfully"

I can work on getting this into a PR

lezwon · 2020-03-15T18:00:12Z

Hey @jeremyjordan, what if we would like to execute different actions if the trainer failed during a validation task? Can we find out which task it failed at from the status? Just wondering if this would be flexible in such a scenario.

jeremyjordan · 2020-03-15T18:41:54Z

@lezwon do you have an example in mind of where you'd need that?

i'd prefer to keep the implementation simple and generic enough such that we don't keep adding new status types.

lezwon · 2020-03-16T01:43:48Z

Don't really have an example yet. Was just considering a scenario like that though.

I guess we could go ahead with the current proposal you mentioned. That should solve my issue for sure 😊

jeremyjordan · 2020-03-21T15:41:23Z

probably want to consider building upon the Trainer state defined in #770

fyi @xingzhaolee - what do you think? (we don't need to include it in #770 just thinking about building off of that)

ghost · 2020-03-22T03:34:32Z

@jeremyjordan sounds like a good idea! also I was thinking that since we have TrainerStatus, can it be the main class while TrainerMode servers as a nested enum instead of putting them all under a single enum? something like:

class TrainerMode(enum.Enum):
    TRAINING = enum.auto()
    VALIDATING = enum.auto()
    TESTING = enum.auto()

class TrainerStatus(enum.Enum):
    PENDING = enum.auto()
    FAILED = enum.auto()
    INTERRUPTED = enum.auto()
    ...
    MODE = TrainerMode

then user can access the current status and mode through:

# Interrupted when validation is running
if ... is TrainerStatus.INTERRUPTED and ... is TrainerStatus.MODE.VALIDATING:

jeremyjordan · 2020-03-22T21:06:08Z

sure! that would be flexible enough to address @lezwon 's earlier comment. i had started a branch to work on TrainerStatus, once #770 is merged i can pick up that work with your new suggestion here :)

Borda · 2020-03-22T21:28:30Z

if I understand it correctly the .auto means that it can be every run different enum value, right? Even it seems to be starting from 1 and incrementally increase...
if so I would recommend using some exact numbers in case of exporting/importing :]

ghost · 2020-03-23T01:13:52Z

not too sure about that, but probably the order matters too when new status or mode is added. so I guess we should use exact numbers like you suggested in case any import or export is involved! 🙂

jeremyjordan · 2020-04-04T02:57:09Z

i went back and forth on whether the trainer status would be a valuable addition. ultimately, i decided to opt for a simple attribute denoted when a KeyboardInterrupt has been caught.

lezwon added feature Is an improvement or enhancement help wanted Open to be worked on labels Mar 9, 2020

Borda added the let's do it! approved to implement label Mar 11, 2020

Borda added this to the 0.7.2 milestone Mar 11, 2020

jeremyjordan mentioned this issue Mar 28, 2020

add TPU tests #1246

Closed

jeremyjordan mentioned this issue Apr 4, 2020

add trainer attribute to denote if interrupted #1368

Merged

5 tasks

williamFalcon closed this as completed in #1368 Apr 5, 2020

Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid running `on_training_end` after Keyboard Interrupt #1099

Avoid running `on_training_end` after Keyboard Interrupt #1099

lezwon commented Mar 9, 2020

github-actions bot commented Mar 9, 2020

Borda commented Mar 11, 2020

jeremyjordan commented Mar 12, 2020

lezwon commented Mar 12, 2020

Borda commented Mar 12, 2020

jeremyjordan commented Mar 13, 2020

Borda commented Mar 13, 2020

jeremyjordan commented Mar 14, 2020

Borda commented Mar 14, 2020

lezwon commented Mar 15, 2020

jeremyjordan commented Mar 15, 2020

lezwon commented Mar 15, 2020

jeremyjordan commented Mar 15, 2020

lezwon commented Mar 16, 2020

jeremyjordan commented Mar 21, 2020

ghost commented Mar 22, 2020

jeremyjordan commented Mar 22, 2020

Borda commented Mar 22, 2020

ghost commented Mar 23, 2020

jeremyjordan commented Apr 4, 2020

Avoid running on_training_end after Keyboard Interrupt #1099

Avoid running on_training_end after Keyboard Interrupt #1099

Comments

lezwon commented Mar 9, 2020

🚀 Feature

Motivation

Pitch

github-actions bot commented Mar 9, 2020

Borda commented Mar 11, 2020

jeremyjordan commented Mar 12, 2020

lezwon commented Mar 12, 2020

Borda commented Mar 12, 2020

jeremyjordan commented Mar 13, 2020

Borda commented Mar 13, 2020

jeremyjordan commented Mar 14, 2020

Borda commented Mar 14, 2020

lezwon commented Mar 15, 2020

jeremyjordan commented Mar 15, 2020

lezwon commented Mar 15, 2020

jeremyjordan commented Mar 15, 2020

lezwon commented Mar 16, 2020

jeremyjordan commented Mar 21, 2020

ghost commented Mar 22, 2020

jeremyjordan commented Mar 22, 2020

Borda commented Mar 22, 2020

ghost commented Mar 23, 2020

jeremyjordan commented Apr 4, 2020

Avoid running `on_training_end` after Keyboard Interrupt #1099

Avoid running `on_training_end` after Keyboard Interrupt #1099