[ML] Anomaly detection job reset can get stuck with no way to unblock #105928

droberts195 · 2024-03-04T18:15:53Z

When an anomaly detection job is reset we start a local (non-persistent) task to do the resetting operations.

It is possible for this task to disappear before it completes. One well-known way this can happen is if the node it is running on dies while it is running.

Looking at the code, you'd think that a way to unblock the reset would be to call the reset endpoint again for the same job. It appears that resetIfJobIsStillBlockedOnReset will be called from here:

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportResetJobAction.java

Line 127 in f77c16b

    
           ActionListener.wrap(r -> resetIfJobIsStillBlockedOnReset(task, request, listener), listener::onFailure)

However, this doesn't happen if the original task disappeared without trace (for example on a node dying) rather than failed with an error. What actually happens in this case is that the second reset call returns an error like this:

{
  "error": {
    "root_cause": [
      {
        "type": "resource_not_found_exception",
        "reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
      }
    ],
    "type": "resource_not_found_exception",
    "reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
  },
  "status": 404
}

This is coming from the error handler of the get task call cascading through the listeners from here:

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportResetJobAction.java

Line 164 in f77c16b

}, listener::onFailure));

Another problem is that if a reset ends up stalled because of a node dying, it's not very friendly to wait until a user retries it. Job deletions can have the same problem, and we retry those automatically as part of our nightly maintenance task.

Therefore, we should make two changes:

Change the reset code so that we call resetIfJobIsStillBlockedOnReset not just in the success handler but also in the failure handler of a second or subsequent retry if ResourceNotFoundException is the cause of the failure.
Extend the code in the ML daily maintenance task to check for resets that have no corresponding tasks and retry them in the same way that deletes without corresponding tasks are retried.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-03-04T18:16:17Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2024-03-04T18:17:26Z

For reference, the PR that added the daily retries for deletes was #60121.

Fixes two issues: - When a job is in a blocked state (resetting, deleting reverting) but the underlying task [cannot be found](elastic/elasticsearch#105928), the task polling fails to start correctly and instead enters a loop where the tasks are checked as fast as possible. - Some tasks can legitimately take a long time to run, but we still poll at the same 2 second rate. This PR fixes the feedback loop and adds a check for when a poll has been running for over a minute, the poll interval is increased to 2 minutes. Related to #171626

droberts195 added >bug :ml Machine learning labels Mar 4, 2024

elasticsearchmachine added the Team:ML Meta label for the ML team label Mar 4, 2024

jan-elastic self-assigned this Mar 5, 2024

This was referenced Mar 6, 2024

Reset job if existing reset fails #106020

Merged

During ML maintenance, reset jobs in the reset state without a corresponding task. #106062

Merged

jgowdyelastic mentioned this issue Mar 7, 2024

[ML] Fixes polling for blocked anomaly detection jobs elastic/kibana#178246

Merged

jan-elastic closed this as completed Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Anomaly detection job reset can get stuck with no way to unblock #105928

[ML] Anomaly detection job reset can get stuck with no way to unblock #105928

droberts195 commented Mar 4, 2024

elasticsearchmachine commented Mar 4, 2024

droberts195 commented Mar 4, 2024

[ML] Anomaly detection job reset can get stuck with no way to unblock #105928

[ML] Anomaly detection job reset can get stuck with no way to unblock #105928

Comments

droberts195 commented Mar 4, 2024

elasticsearchmachine commented Mar 4, 2024

droberts195 commented Mar 4, 2024