[ML] _all requests can suffer "job not found" errors #37959

droberts195 · 2019-01-29T09:47:09Z

(Migrated from #37545 (comment) to improve visibility.)

The failure of https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/166/ showed that it is possible for a request to do some ML operation for _all can return an error that it could not find an entity it expected to find.

For example, closing _all jobs might return an error that job foo does not exist. Or stopping _all datafeeds might return an error that datafeed bar does not exist.

This seems completely crazy, as it's obvious that _all should only include entities that exist.

The reason this can happen is that our actions involve multiple base level Elasticsearch actions chained together, and entities could be deleted in between these base level steps. For example:

Alice requests force delete of job foo
Bob requests close _all jobs
Bob's request to close _all jobs expands _all to foo and bar
Alice's request to force delete foo removes the config associated with job foo
Bob's request to close _all jobs attempts to find the config for job foo
Bob's request to close _all fails because the config for job foo does not exist

Although the test failure that highlighted this problem was a 6.5 test run, I suspect the problem is worse in 6.6 and above because expanding _all requires a search for configs in an index rather than just looking in the (in-memory on all nodes) cluster state.

ML actions that operate on _all should silently ignore failures to find entities from the original expansion of _all, on the assumption that these entities have been deleted by a concurrent request.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-29T09:47:12Z

Pinging @elastic/ml-core

droberts195 added >bug :ml Machine learning labels Jan 29, 2019

droberts195 mentioned this issue Jan 29, 2019

Unexpected job state [failed] while waiting for job to be opened #37545

Closed

benwtrent self-assigned this Jan 31, 2019

benwtrent mentioned this issue Jan 31, 2019

ML: Fix error race condition on stop _all datafeeds and close _all jobs #38113

Merged

benwtrent closed this as completed in #38113 Feb 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] _all requests can suffer "job not found" errors #37959

[ML] _all requests can suffer "job not found" errors #37959

droberts195 commented Jan 29, 2019 •

edited

Loading

elasticmachine commented Jan 29, 2019

[ML] _all requests can suffer "job not found" errors #37959

[ML] _all requests can suffer "job not found" errors #37959

Comments

droberts195 commented Jan 29, 2019 • edited Loading

elasticmachine commented Jan 29, 2019

droberts195 commented Jan 29, 2019 •

edited

Loading