[CI] [ML] RestoreModelSnapshotIT failures #36849

davidkyle · 2018-12-19T16:27:51Z

There were 2 recent failures of this test

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=virtual&&linux/128/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+periodic/382/console

Both failed with the same error

Expected: <29624L>
     but: was <29658L>
	at __randomizedtesting.SeedInfo.seed([1A6B5FAED67A2663:923F607478864B9B]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.xpack.ml.integration.RestoreModelSnapshotIT.lambda$test$2(RestoreModelSnapshotIT.java:92)

./gradlew :x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner \
  -Dtests.seed=1A6B5FAED67A2663 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.RestoreModelSnapshotIT \
  -Dtests.method="test" \
  -Dtests.security.manager=true \
  -Dtests.locale=en-AU \
  -Dtests.timezone=Africa/Malabo \
  -Dcompiler.java=11 \
  -Druntime.java=11

./gradlew :x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner \
  -Dtests.seed=ED6307FA54D821E \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.RestoreModelSnapshotIT \
  -Dtests.method="test" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-JO \
  -Dtests.timezone=Europe/San_Marino \
  -Dcompiler.java=11 \
  -Druntime.java=8

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-19T16:27:53Z

Pinging @elastic/ml-core

davidkyle · 2018-12-19T17:33:18Z

Muted on the 6.x branch in e31f80c

The problem is a version conflict exception updating the job

[2018-12-19T15:03:24,126][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-2] [restore-model-snapshot-job] job closed
[2018-12-19T15:03:24,159][ERROR][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-2] [restore-model-snapshot-job] Failed to update job with new established model memory [29606]
org.elasticsearch.transport.RemoteTransportException: [node-0][127.0.0.1:37085][cluster:admin/xpack/ml/job/update]
Caused by: org.elasticsearch.index.engine.VersionConflictEngineException: [doc][anomaly_detector-restore-model-snapshot-job]: version conflict, current version [7] is different than the one provided [6]

From the order of the log messages AutodetectCommunicator logs [restore-model-snapshot-job] job closed before the update fails so there is a problem in AutoDetectResultProcessor.awaitCompletion() in that it has not waited properly for completion probably because jobUpdateSemaphore can be acquired before the update starts. It is a bit racy this class and needs a rethink

droberts195 · 2018-12-20T13:14:17Z

I had a PR build fail with a similar version conflict exception: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/3034/console

11:39:25 [2018-12-20T11:38:44,673][ERROR][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-1] [overall-buckets-test] Failed to update job with new established model memory [30196]
11:39:25 org.elasticsearch.index.engine.VersionConflictEngineException: [doc][anomaly_detector-overall-buckets-test]: version conflict, current version [5] is different than the one provided [4]

The job this happened to was:

11:39:25 [2018-12-20T11:33:05,769][WARN ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-1] [lookback-job-stopped-then-killed] some results not processed due to the process being killed

So it's not surprising that results were still being processed at the time the job completion code ran.

It's further evidence of a race though, and now these updates are not going via the cluster state thread we need to be more careful to ensure updates to the job only come from one thread at any phase in the job lifecycle.

#36856) There was a race where the job update in `AutoDetectResultProcessor.updateEstablishedModelMemoryOnJob` could execute after `AutoDetectResultProcessor.awaitCompletion` returned. This was because ` jobUpdateSemaphore` was acquired after the call to `jobResultsProvider.getEstablishedMemoryUsage` and during that call `awaitCompletion` is free to acquire and release the semaphore after which the method returns. This commit fixes the problem. Closes #36849

Enable the previously muted test. This reverts commit e31f80c.

davidkyle · 2019-01-02T13:50:01Z

Closed by #36856 and the test was un-muted in 8d40f4b

davidkyle added >test-failure Triaged test failures from CI :ml Machine learning labels Dec 19, 2018

davidkyle mentioned this issue Dec 19, 2018

[ML] Metadata Migration Meta issue #32905

Closed

43 tasks

davidkyle added a commit that referenced this issue Dec 19, 2018

[CI] Mute RestoreModelSnapshotIT #36849

e31f80c

davidkyle mentioned this issue Dec 19, 2018

[ML] Wait for job updates in AutoDetectResultProcessor.awaitCompletion #36856

Merged

davidkyle added a commit that referenced this issue Jan 2, 2019

Revert "[CI] Mute RestoreModelSnapshotIT #36849"

8d40f4b

Enable the previously muted test. This reverts commit e31f80c.

davidkyle closed this as completed Jan 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] [ML] RestoreModelSnapshotIT failures #36849

[CI] [ML] RestoreModelSnapshotIT failures #36849

davidkyle commented Dec 19, 2018

elasticmachine commented Dec 19, 2018

davidkyle commented Dec 19, 2018

droberts195 commented Dec 20, 2018

davidkyle commented Jan 2, 2019

[CI] [ML] RestoreModelSnapshotIT failures #36849

[CI] [ML] RestoreModelSnapshotIT failures #36849

Comments

davidkyle commented Dec 19, 2018

elasticmachine commented Dec 19, 2018

davidkyle commented Dec 19, 2018

droberts195 commented Dec 20, 2018

davidkyle commented Jan 2, 2019