Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Anomaly detection job reset can get stuck with no way to unblock #105928

Closed
droberts195 opened this issue Mar 4, 2024 · 2 comments
Closed
Assignees
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@droberts195
Copy link
Contributor

When an anomaly detection job is reset we start a local (non-persistent) task to do the resetting operations.

It is possible for this task to disappear before it completes. One well-known way this can happen is if the node it is running on dies while it is running.

Looking at the code, you'd think that a way to unblock the reset would be to call the reset endpoint again for the same job. It appears that resetIfJobIsStillBlockedOnReset will be called from here:

ActionListener.wrap(r -> resetIfJobIsStillBlockedOnReset(task, request, listener), listener::onFailure)

However, this doesn't happen if the original task disappeared without trace (for example on a node dying) rather than failed with an error. What actually happens in this case is that the second reset call returns an error like this:

{
  "error": {
    "root_cause": [
      {
        "type": "resource_not_found_exception",
        "reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
      }
    ],
    "type": "resource_not_found_exception",
    "reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
  },
  "status": 404
}

This is coming from the error handler of the get task call cascading through the listeners from here:

Another problem is that if a reset ends up stalled because of a node dying, it's not very friendly to wait until a user retries it. Job deletions can have the same problem, and we retry those automatically as part of our nightly maintenance task.

Therefore, we should make two changes:

  1. Change the reset code so that we call resetIfJobIsStillBlockedOnReset not just in the success handler but also in the failure handler of a second or subsequent retry if ResourceNotFoundException is the cause of the failure.
  2. Extend the code in the ML daily maintenance task to check for resets that have no corresponding tasks and retry them in the same way that deletes without corresponding tasks are retried.
@droberts195 droberts195 added >bug :ml Machine learning labels Mar 4, 2024
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Mar 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

For reference, the PR that added the daily retries for deletes was #60121.

@jan-elastic jan-elastic self-assigned this Mar 5, 2024
jgowdyelastic added a commit to elastic/kibana that referenced this issue Mar 12, 2024
Fixes two issues:
- When a job is in a blocked state (resetting, deleting reverting) but
the underlying task [cannot be
found](elastic/elasticsearch#105928), the task
polling fails to start correctly and instead enters a loop where the
tasks are checked as fast as possible.
- Some tasks can legitimately take a long time to run, but we still poll
at the same 2 second rate.

This PR fixes the feedback loop and adds a check for when a poll has
been running for over a minute, the poll interval is increased to 2
minutes.

Related to #171626
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

3 participants