Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful keyboard doesn't work as expected #965

Closed
williamFalcon opened this issue Feb 27, 2020 · 8 comments · Fixed by #2029
Closed

Graceful keyboard doesn't work as expected #965

williamFalcon opened this issue Feb 27, 2020 · 8 comments · Fixed by #2029
Assignees
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@williamFalcon
Copy link
Contributor

@jeremyjordan
Not sure the fix is what you were looking for? Make sure to also log the message only from proc_0.

image

@williamFalcon williamFalcon added bug Something isn't working help wanted Open to be worked on labels Feb 27, 2020
@jeremyjordan
Copy link
Contributor

jeremyjordan commented Feb 27, 2020

Thanks I’ll look into this. To reproduce, is this running in a Colab notebook with a TPU enabled?

@jeremyjordan
Copy link
Contributor

I spent some time looking into this but haven't yet figured out how to gracefully exit when using DDP. We know that the first KeyboardInterrupt was caught due to the logging.

Reading the docs for torch.multiprocessing I see it says:

In the case an exception was caught in the child process, it is forwarded and its traceback is included in the exception raised in the parent process.

I'm wondering if the KeyboardInterrupt is being sent to all of the processes and the child processes are re-raising to the main process which can only handle the exception once.

@Borda
Copy link
Member

Borda commented Apr 9, 2020

@jeremyjordan how is it going here?

@jeremyjordan
Copy link
Contributor

@Borda I could use some help figuring out how to fail gracefully in the case of DDP

when the user signals a KeyboardInterrupt for the first time, it should signal all of the nodes to finish their work, collect back onto the main process, and then run the teardown routine. if the user signals for a second KeyboardInterrupt, it should stop immediately.

@Borda
Copy link
Member

Borda commented Apr 13, 2020

Do you mean that first KeyboardInterrupt ends the teardown and the second KeyboardInterrupt ends even the teardown? Maybe we shall really implement the trainer status enum...

@jeremyjordan
Copy link
Contributor

yes i believe an exception is being thrown for each worker and raised back to the main process. the main process catches the first exception and begins running the training teardown but then the subsequent exceptions cause it to halt immediately. does that make sense? any ideas how to handle that?

@sneiman
Copy link
Contributor

sneiman commented Apr 25, 2020

I have not tried to solve this particular problem - but I have wrestled some similar problems - sharing what I learned in case it helps.

You can probably intercept the KeyboardInterrupt in each process by setting a signal handler like this in each ddp process:

def signal_handler(sig, frame):
        # do something here - probably pass
        # I would try to re raise the error from None - to avoid sending all the history ...
        # might even translate it to your own exception ... ExitTraining?
        raise ExitTraining from None
signal.signal(signal.SIGINT, signal_handler)

To make these active in each ddp process, they have to be defined and signal.signal() called in ddp_train().

Catch ExitTraining in the main, cpu based process. At this point, you 'should' have control of the interrupt.

You can create a shared Value (see python/pytorch multiprocessing) in Trainer.__init__(). You then use this as a flag to signal time to teardown. So ... your handler for ExitTraining in the main process would set the flag to True. And somewhere AFTER a barrier call in the training loop you check it, and initiate teardown.
NOTE that for ddp, you MUST create the shared value with

mp.get_context('spawn').Value(...)

This should work on a single node. To work across nodes, I believe you will have to create a Manager() - see mp.

Hope this at leasts gives you a useful direction ...
I do something similar

@cmpute
Copy link
Contributor

cmpute commented Apr 29, 2020

A short note, I also encountered this problem in DDP. I found that the KeyboardInterrupt is not caught sometimes, and even though it's caught, the self.interrupted is only set in child process, which means the value for the main process is not set. A workaround for me is to try-catch the mp.spawn() sentence and set the flags in the main process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants