Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation container does not exit on error, error is not reported #47

Closed
amickan opened this issue Jul 26, 2024 · 10 comments
Closed

Evaluation container does not exit on error, error is not reported #47

amickan opened this issue Jul 26, 2024 · 10 comments
Assignees

Comments

@amickan
Copy link
Contributor

amickan commented Jul 26, 2024

Yesterday and today I spent a couple of hours debugging why an evaluation container stalled. I checked the logs on Sagemaker but there was nothing there. The container stopped doing stuff after mounting the inputs for the container and then just sat around doing apparently nothing for an hour, until Sagemaker killed the container because the time limit of 1 hour was reached.

I assumed that error messages would be raised and appear in the logs of the container, like they do for algorithm jobs, and since that was not the case here, I assumed that something else was going on and was led down the wrong rabbit hole in debugging this. This container was using the recently added improvements for multiprocessing implemented here (#41). We ultimately found the error by adding a simple print statement to the handle_error() function that prints the error.

I'm unsure if we should add this print statement there, users should obviously be informed of errors in their scripts, but I assume that the solution is a bit more involved since the container should exit when it encounters an error, and that clearly did not happen in this case. Something with the termination of the child processes didn't work. @pkcakeout can hopefully take a look when he is back.

Feel free to rename the issue once you know what the problem is. I might not be hitting the nail on its head here.

@chrisvanrun
Copy link
Collaborator

The multiprocessing had taken a direct child process dying into account. Either via a 'regular' non-0 exit or a hard kill via OOM.

However, if that initial child then spawns grandchildren, it is the responsibility of that initial child to keep track of any errors or killings.

I made a quick fix for the sub child failing. But I am not sure that actually adresses the problem.

Anne just commented that it was actually an incorrect path for ground truth, seems like it is very low level and that should definitely have been picked up by the regular error catching.

Currently have query out to the user to get some details.

@chrisvanrun
Copy link
Collaborator

The PR has a provisional fix, that was on the assumption that a grandchild was stalling and was never picked up.

@chrisvanrun
Copy link
Collaborator

After hearing out the user in more detail I am worried that the multi-processing error reporting might work locally but not on GC. Running a few tests now.

@jmsmkn
Copy link
Member

jmsmkn commented Aug 1, 2024

You could integrate https://github.com/DIAGNijmegen/rse-sagemaker-shim/tree/main into the container tests. There are examples in the tests directory of that repo, e.g. https://github.com/DIAGNijmegen/rse-sagemaker-shim/blob/fb080049859fa4d43e55362535bd9510df9dbb9a/tests/test_container_image.py#L149

Would be a very slow test but maybe good for integration?

@chrisvanrun
Copy link
Collaborator

After hearing out the user in more detail I am worried that the multi-processing error reporting might work locally but not on GC. Running a few tests now.

In this case, the tests all came back as I expected. Most of this is post hoc guessing. Plus, the actual problem was the user uploading to ground truths rather than evaluation methods. As such, I am still not 100% convinced the old Pool method was not used.

@jmsmkn
Copy link
Member

jmsmkn commented Aug 1, 2024

Maybe include the forge generation version in stdout by default?

@chrisvanrun
Copy link
Collaborator

Maybe include the forge generation version in stdout by default?

It would not have helped since the user manually patches the pack to incorporate the new executor.

@jmsmkn
Copy link
Member

jmsmkn commented Aug 1, 2024

Yes, but might help debugging for future users, save getting the same problem again.

@jmsmkn
Copy link
Member

jmsmkn commented Aug 1, 2024

Or maybe all the things that are irrelevant to a users actual execution should be pushed to another package on pypi that can be easily version controlled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants