Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid running on_training_end after Keyboard Interrupt #1099

Closed
lezwon opened this issue Mar 9, 2020 · 20 comments · Fixed by #1368
Closed

Avoid running on_training_end after Keyboard Interrupt #1099

lezwon opened this issue Mar 9, 2020 · 20 comments · Fixed by #1368
Labels
feature Is an improvement or enhancement help wanted Open to be worked on let's do it! approved to implement
Milestone

Comments

@lezwon
Copy link
Contributor

lezwon commented Mar 9, 2020

🚀 Feature

Right now, due to #795 , on_training_end runs after a KeyboardInterrupt. This ends up running code that's meant to be run only after successful training completion. This feature should either be reverted, or an alternative should be provided, so as to run some code only after successful training.

Motivation

I am training a model on Sagemaker and have added a notebook shutdown code within the on_training_end method. There were times where I had to manually cancel my model training because some parameters were incorrect. However, If I do that, the notebook shuts down immediately. This is because the on_training_end method runs even after a KeyboardInterrupt. I don't want my notebook shutting down after a keyboard interrupt, only after successful training completion.

Pitch

Maybe add an on_training_completed method for code that's meant to be run after successful training.

@lezwon lezwon added feature Is an improvement or enhancement help wanted Open to be worked on labels Mar 9, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2020

Hi! thanks for your contribution!, great first issue!

@Borda
Copy link
Member

Borda commented Mar 11, 2020

Good point, it makes sense to skip eval of the unfinished train also when a user is it he doesn't want to wait for another hour lol Mind send a PR? 🤖

@Borda Borda added the let's do it! approved to implement label Mar 11, 2020
@Borda Borda added this to the 0.7.2 milestone Mar 11, 2020
@jeremyjordan
Copy link
Contributor

Thanks for bringing this use-case to our attention @lezwon , definitely something we'd want to consider.

What if the Trainer object had a status property? Similar to how compute jobs will have a status {PENDING, RUNNING, COMPLETE, FAILED} we could apply something similar here for the training job. In this case we could differentiate the two cases with a status of INTERRUPTED vs COMPLETED and your callback logic can check for the proper status before closing (eg. shutdown notebook on FAILED/COMPLETED but don't shutdown for INTERRUPTED).

What do you think?

@lezwon
Copy link
Contributor Author

lezwon commented Mar 12, 2020

@jeremyjordan that sounds great. We could definitely do that.

@Borda
Copy link
Member

Borda commented Mar 12, 2020

Not sure if we really need to add complexity to the existing callbacks...
Also on_training_completed is standard behavior so rather some cleanup for interrupted?
@jeremyjordan status sounds cool, by my opinion it is much better signal the returned 0/1 from e.g. .fit =)

@jeremyjordan
Copy link
Contributor

@Borda are you suggesting that we add a new callback (on_training_completed that @lezwon mentioned) which runs conditionally according to the value returned from fit()? My only worry is that it might be confusing to know the difference between on_training_end (existing) and on_training_complete (proposed).

@Borda
Copy link
Member

Borda commented Mar 13, 2020

in this moment I was just thinking about adding Trainer status...
not sure if we want to add complexity with new callback...

@jeremyjordan
Copy link
Contributor

Ok yes I agree, I think the Trainer status would be a simple solution that may also be useful in other situations as well :)

@Borda
Copy link
Member

Borda commented Mar 14, 2020

Cool, does anyone send a PR with Trainer status? I may suggest to implement it as enum/numbering similar like logging.INFO/DEBUG/...

@lezwon
Copy link
Contributor Author

lezwon commented Mar 15, 2020

Will this status also account for validation/test completed/interrupted?

@jeremyjordan
Copy link
Contributor

yes we could do something like:

>>> from enum import Enum
>>> class TrainerStatus(Enum):
...     PENDING = "Initializing Trainer object"
...     TRAINING = "Optimizing model via train loop"
...     VALIDATING = "Running model on validation set"
...     TESTING = "Running model on test set"
...     FAILED = "Trainer failed to complete a successful run"
...     INTERRUPTED = "Training was interrupted by the user"
...     COMPLETED = "Training completed successfully"

I can work on getting this into a PR

@lezwon
Copy link
Contributor Author

lezwon commented Mar 15, 2020

Hey @jeremyjordan, what if we would like to execute different actions if the trainer failed during a validation task? Can we find out which task it failed at from the status? Just wondering if this would be flexible in such a scenario.

@jeremyjordan
Copy link
Contributor

@lezwon do you have an example in mind of where you'd need that?

i'd prefer to keep the implementation simple and generic enough such that we don't keep adding new status types.

@lezwon
Copy link
Contributor Author

lezwon commented Mar 16, 2020

Don't really have an example yet. Was just considering a scenario like that though.

I guess we could go ahead with the current proposal you mentioned. That should solve my issue for sure 😊

@jeremyjordan
Copy link
Contributor

probably want to consider building upon the Trainer state defined in #770

fyi @xingzhaolee - what do you think? (we don't need to include it in #770 just thinking about building off of that)

@ghost
Copy link

ghost commented Mar 22, 2020

@jeremyjordan sounds like a good idea! also I was thinking that since we have TrainerStatus, can it be the main class while TrainerMode servers as a nested enum instead of putting them all under a single enum? something like:

class TrainerMode(enum.Enum):
    TRAINING = enum.auto()
    VALIDATING = enum.auto()
    TESTING = enum.auto()

class TrainerStatus(enum.Enum):
    PENDING = enum.auto()
    FAILED = enum.auto()
    INTERRUPTED = enum.auto()
    ...
    MODE = TrainerMode

then user can access the current status and mode through:

# Interrupted when validation is running
if ... is TrainerStatus.INTERRUPTED and ... is TrainerStatus.MODE.VALIDATING:

@jeremyjordan
Copy link
Contributor

sure! that would be flexible enough to address @lezwon 's earlier comment. i had started a branch to work on TrainerStatus, once #770 is merged i can pick up that work with your new suggestion here :)

@Borda
Copy link
Member

Borda commented Mar 22, 2020

if I understand it correctly the .auto means that it can be every run different enum value, right? Even it seems to be starting from 1 and incrementally increase...
if so I would recommend using some exact numbers in case of exporting/importing :]

@ghost
Copy link

ghost commented Mar 23, 2020

not too sure about that, but probably the order matters too when new status or mode is added. so I guess we should use exact numbers like you suggested in case any import or export is involved! 🙂

@jeremyjordan
Copy link
Contributor

i went back and forth on whether the trainer status would be a valuable addition. ultimately, i decided to opt for a simple attribute denoted when a KeyboardInterrupt has been caught.

@Borda Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on let's do it! approved to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants