Evaluation container does not exit on error, error is not reported #47

amickan · 2024-07-26T10:01:28Z

Yesterday and today I spent a couple of hours debugging why an evaluation container stalled. I checked the logs on Sagemaker but there was nothing there. The container stopped doing stuff after mounting the inputs for the container and then just sat around doing apparently nothing for an hour, until Sagemaker killed the container because the time limit of 1 hour was reached.

I assumed that error messages would be raised and appear in the logs of the container, like they do for algorithm jobs, and since that was not the case here, I assumed that something else was going on and was led down the wrong rabbit hole in debugging this. This container was using the recently added improvements for multiprocessing implemented here (#41). We ultimately found the error by adding a simple print statement to the handle_error() function that prints the error.

I'm unsure if we should add this print statement there, users should obviously be informed of errors in their scripts, but I assume that the solution is a bit more involved since the container should exit when it encounters an error, and that clearly did not happen in this case. Something with the termination of the child processes didn't work. @pkcakeout can hopefully take a look when he is back.

Feel free to rename the issue once you know what the problem is. I might not be hitting the nail on its head here.

The text was updated successfully, but these errors were encountered:

amickan · 2024-07-26T10:06:06Z

Cloudwatch logs for the container with the debug statements added: https://eu-west-1.console.aws.amazon.com/cloudwatch/home?region=eu-west-1#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FTrainingJobs/log-events/rumc-gcorg-p-E-3032dd2a-bf39-476d-8b76-3ea280b6db67-00$252Falgo-1-1721983402

chrisvanrun · 2024-07-29T14:42:34Z

The multiprocessing had taken a direct child process dying into account. Either via a 'regular' non-0 exit or a hard kill via OOM.

However, if that initial child then spawns grandchildren, it is the responsibility of that initial child to keep track of any errors or killings.

I made a quick fix for the sub child failing. But I am not sure that actually adresses the problem.

Anne just commented that it was actually an incorrect path for ground truth, seems like it is very low level and that should definitely have been picked up by the regular error catching.

Currently have query out to the user to get some details.

chrisvanrun · 2024-07-29T14:53:09Z

The PR has a provisional fix, that was on the assumption that a grandchild was stalling and was never picked up.

chrisvanrun · 2024-07-30T08:14:39Z

After hearing out the user in more detail I am worried that the multi-processing error reporting might work locally but not on GC. Running a few tests now.

jmsmkn · 2024-08-01T08:38:26Z

You could integrate https://github.com/DIAGNijmegen/rse-sagemaker-shim/tree/main into the container tests. There are examples in the tests directory of that repo, e.g. https://github.com/DIAGNijmegen/rse-sagemaker-shim/blob/fb080049859fa4d43e55362535bd9510df9dbb9a/tests/test_container_image.py#L149

Would be a very slow test but maybe good for integration?

chrisvanrun · 2024-08-01T14:40:32Z

After hearing out the user in more detail I am worried that the multi-processing error reporting might work locally but not on GC. Running a few tests now.

In this case, the tests all came back as I expected. Most of this is post hoc guessing. Plus, the actual problem was the user uploading to ground truths rather than evaluation methods. As such, I am still not 100% convinced the old Pool method was not used.

jmsmkn · 2024-08-01T14:44:07Z

Maybe include the forge generation version in stdout by default?

chrisvanrun · 2024-08-01T14:45:34Z

Maybe include the forge generation version in stdout by default?

It would not have helped since the user manually patches the pack to incorporate the new executor.

jmsmkn · 2024-08-01T14:58:47Z

Yes, but might help debugging for future users, save getting the same problem again.

jmsmkn · 2024-08-01T15:00:07Z

Or maybe all the things that are irrelevant to a users actual execution should be pushed to another package on pypi that can be easily version controlled?

amickan assigned pkcakeout Jul 26, 2024

amickan assigned chrisvanrun Jul 29, 2024

chrisvanrun unassigned pkcakeout Jul 29, 2024

chrisvanrun mentioned this issue Jul 29, 2024

Simplify multi-processing error handling #48

Merged

chrisvanrun closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation container does not exit on error, error is not reported #47

Evaluation container does not exit on error, error is not reported #47

amickan commented Jul 26, 2024

amickan commented Jul 26, 2024

chrisvanrun commented Jul 29, 2024

chrisvanrun commented Jul 29, 2024

chrisvanrun commented Jul 30, 2024

jmsmkn commented Aug 1, 2024

chrisvanrun commented Aug 1, 2024

jmsmkn commented Aug 1, 2024

chrisvanrun commented Aug 1, 2024

jmsmkn commented Aug 1, 2024

jmsmkn commented Aug 1, 2024

Evaluation container does not exit on error, error is not reported #47

Evaluation container does not exit on error, error is not reported #47

Comments

amickan commented Jul 26, 2024

amickan commented Jul 26, 2024

chrisvanrun commented Jul 29, 2024

chrisvanrun commented Jul 29, 2024

chrisvanrun commented Jul 30, 2024

jmsmkn commented Aug 1, 2024

chrisvanrun commented Aug 1, 2024

jmsmkn commented Aug 1, 2024

chrisvanrun commented Aug 1, 2024

jmsmkn commented Aug 1, 2024

jmsmkn commented Aug 1, 2024