Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCTRL-920 Fixes for stuck calibration runs #609

Merged
merged 3 commits into from
Sep 18, 2024

Commits on Sep 13, 2024

  1. [core] improve a few logs

    - "environment" is changed to "partition", so IL recognizes it correctly
    - we log the name of the attempted transition if it is delayed due to other transition already taking place
    - we log when we resume a delayed transition or teardown.
    - we log what was the state attempted to be updated when the task is not in roster
    knopers8 committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    4fdeae0 View commit details
    Browse the repository at this point in the history
  2. OCTRL-920 [core] invalidating auto-stop lets the goroutine exit

    The goroutine for performing would never exit in case that the auto-stop would be invalidated. Consequently, it would be stuck indefinitely, surpassing the environment's lifetime.
    I don't think this was causing any trouble except of just leaking goroutines and obfuscating the state of the application.
    We also invalidate the auto-stop in case of going to ERROR, so we do not trigger a spurious transition attempt when it's pointless to do so.
    knopers8 committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    bda8717 View commit details
    Browse the repository at this point in the history
  3. [core] OCTRL-920 safer concurrency in KillTasks

    Parallel attempts to kill tasks were found to be the primary cause for stuck auto-environments.
    In particular, it was due to channels in ackKilledTasks (and handling them) not expecting multiple listeners,
     so either one of the two kill acknowledgments would be stuck waiting for the acknowledgment to be received, or the other side, waiting for acknowledgment would never get it.
    It would cause KillTasks to be stuck indefinitely, which blocks the main auto-environment code-path.
    knopers8 committed Sep 13, 2024
    Configuration menu
    Copy the full SHA
    2c0940b View commit details
    Browse the repository at this point in the history