Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] _all requests can suffer "job not found" errors #37959

Closed
droberts195 opened this issue Jan 29, 2019 · 1 comment · Fixed by #38113
Closed

[ML] _all requests can suffer "job not found" errors #37959

droberts195 opened this issue Jan 29, 2019 · 1 comment · Fixed by #38113
Assignees
Labels
>bug :ml Machine learning

Comments

@droberts195
Copy link
Contributor

droberts195 commented Jan 29, 2019

(Migrated from #37545 (comment) to improve visibility.)

The failure of https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/166/ showed that it is possible for a request to do some ML operation for _all can return an error that it could not find an entity it expected to find.

For example, closing _all jobs might return an error that job foo does not exist. Or stopping _all datafeeds might return an error that datafeed bar does not exist.

This seems completely crazy, as it's obvious that _all should only include entities that exist.

The reason this can happen is that our actions involve multiple base level Elasticsearch actions chained together, and entities could be deleted in between these base level steps. For example:

  1. Alice requests force delete of job foo
  2. Bob requests close _all jobs
  3. Bob's request to close _all jobs expands _all to foo and bar
  4. Alice's request to force delete foo removes the config associated with job foo
  5. Bob's request to close _all jobs attempts to find the config for job foo
  6. Bob's request to close _all fails because the config for job foo does not exist

Although the test failure that highlighted this problem was a 6.5 test run, I suspect the problem is worse in 6.6 and above because expanding _all requires a search for configs in an index rather than just looking in the (in-memory on all nodes) cluster state.

ML actions that operate on _all should silently ignore failures to find entities from the original expansion of _all, on the assumption that these entities have been deleted by a concurrent request.

@droberts195 droberts195 added >bug :ml Machine learning labels Jan 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants