You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When an anomaly detection job is reset we start a local (non-persistent) task to do the resetting operations.
It is possible for this task to disappear before it completes. One well-known way this can happen is if the node it is running on dies while it is running.
Looking at the code, you'd think that a way to unblock the reset would be to call the reset endpoint again for the same job. It appears that resetIfJobIsStillBlockedOnReset will be called from here:
However, this doesn't happen if the original task disappeared without trace (for example on a node dying) rather than failed with an error. What actually happens in this case is that the second reset call returns an error like this:
{
"error": {
"root_cause": [
{
"type": "resource_not_found_exception",
"reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
}
],
"type": "resource_not_found_exception",
"reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
},
"status": 404
}
This is coming from the error handler of the get task call cascading through the listeners from here:
Another problem is that if a reset ends up stalled because of a node dying, it's not very friendly to wait until a user retries it. Job deletions can have the same problem, and we retry those automatically as part of our nightly maintenance task.
Therefore, we should make two changes:
Change the reset code so that we call resetIfJobIsStillBlockedOnReset not just in the success handler but also in the failure handler of a second or subsequent retry if ResourceNotFoundException is the cause of the failure.
Extend the code in the ML daily maintenance task to check for resets that have no corresponding tasks and retry them in the same way that deletes without corresponding tasks are retried.
The text was updated successfully, but these errors were encountered:
Fixes two issues:
- When a job is in a blocked state (resetting, deleting reverting) but
the underlying task [cannot be
found](elastic/elasticsearch#105928), the task
polling fails to start correctly and instead enters a loop where the
tasks are checked as fast as possible.
- Some tasks can legitimately take a long time to run, but we still poll
at the same 2 second rate.
This PR fixes the feedback loop and adds a check for when a poll has
been running for over a minute, the poll interval is increased to 2
minutes.
Related to #171626
When an anomaly detection job is reset we start a local (non-persistent) task to do the resetting operations.
It is possible for this task to disappear before it completes. One well-known way this can happen is if the node it is running on dies while it is running.
Looking at the code, you'd think that a way to unblock the reset would be to call the reset endpoint again for the same job. It appears that
resetIfJobIsStillBlockedOnReset
will be called from here:elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportResetJobAction.java
Line 127 in f77c16b
However, this doesn't happen if the original task disappeared without trace (for example on a node dying) rather than failed with an error. What actually happens in this case is that the second reset call returns an error like this:
This is coming from the error handler of the get task call cascading through the listeners from here:
elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportResetJobAction.java
Line 164 in f77c16b
Another problem is that if a reset ends up stalled because of a node dying, it's not very friendly to wait until a user retries it. Job deletions can have the same problem, and we retry those automatically as part of our nightly maintenance task.
Therefore, we should make two changes:
resetIfJobIsStillBlockedOnReset
not just in the success handler but also in the failure handler of a second or subsequent retry ifResourceNotFoundException
is the cause of the failure.The text was updated successfully, but these errors were encountered: