Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCTRL-920 Fixes for stuck calibration runs #609

Merged
merged 3 commits into from
Sep 18, 2024

Conversation

knopers8
Copy link
Collaborator

The three commits fix what was observed as wrong while investigating the case with env 2pedNJwSjL8. The updated core was confirmed to work correctly in this test scenario.

- "environment" is changed to "partition", so IL recognizes it correctly
- we log the name of the attempted transition if it is delayed due to other transition already taking place
- we log when we resume a delayed transition or teardown.
- we log what was the state attempted to be updated when the task is not in roster
The goroutine for performing would never exit in case that the auto-stop would be invalidated. Consequently, it would be stuck indefinitely, surpassing the environment's lifetime.
I don't think this was causing any trouble except of just leaking goroutines and obfuscating the state of the application.
We also invalidate the auto-stop in case of going to ERROR, so we do not trigger a spurious transition attempt when it's pointless to do so.
Parallel attempts to kill tasks were found to be the primary cause for stuck auto-environments.
In particular, it was due to channels in ackKilledTasks (and handling them) not expecting multiple listeners,
 so either one of the two kill acknowledgments would be stuck waiting for the acknowledgment to be received, or the other side, waiting for acknowledgment would never get it.
It would cause KillTasks to be stuck indefinitely, which blocks the main auto-environment code-path.
log.WithField("taskIds", strings.Join(unkillable.GetTaskIds(), ", ")).
Debugf("some tasks cannot be physically killed (already dead?), will instead only be removed from roster")
Debugf("some tasks cannot be physically killed (already dead or being killed in another goroutine?), will instead only be removed from roster")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this happen a lot? Shouldn't this be a Warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this can be a somewhat valid situation if two kill requests happen subsequently, so I'm not sure if a warning is needed. But I don't know how often this may happen.

@knopers8 knopers8 merged commit a5b0ccf into AliceO2Group:master Sep 18, 2024
2 checks passed
@knopers8 knopers8 deleted the octrl-920 branch September 18, 2024 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants