Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] [ML] RestoreModelSnapshotIT failures #36849

Closed
davidkyle opened this issue Dec 19, 2018 · 4 comments
Closed

[CI] [ML] RestoreModelSnapshotIT failures #36849

davidkyle opened this issue Dec 19, 2018 · 4 comments
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@davidkyle
Copy link
Member

There were 2 recent failures of this test

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=virtual&&linux/128/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+periodic/382/console

Both failed with the same error

Expected: <29624L>
     but: was <29658L>
	at __randomizedtesting.SeedInfo.seed([1A6B5FAED67A2663:923F607478864B9B]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.xpack.ml.integration.RestoreModelSnapshotIT.lambda$test$2(RestoreModelSnapshotIT.java:92)
./gradlew :x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner \
  -Dtests.seed=1A6B5FAED67A2663 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.RestoreModelSnapshotIT \
  -Dtests.method="test" \
  -Dtests.security.manager=true \
  -Dtests.locale=en-AU \
  -Dtests.timezone=Africa/Malabo \
  -Dcompiler.java=11 \
  -Druntime.java=11

./gradlew :x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner \
  -Dtests.seed=ED6307FA54D821E \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.RestoreModelSnapshotIT \
  -Dtests.method="test" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-JO \
  -Dtests.timezone=Europe/San_Marino \
  -Dcompiler.java=11 \
  -Druntime.java=8
@davidkyle davidkyle added >test-failure Triaged test failures from CI :ml Machine learning labels Dec 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@davidkyle
Copy link
Member Author

Muted on the 6.x branch in e31f80c

The problem is a version conflict exception updating the job

[2018-12-19T15:03:24,126][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [node-2] [restore-model-snapshot-job] job closed
[2018-12-19T15:03:24,159][ERROR][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-2] [restore-model-snapshot-job] Failed to update job with new established model memory [29606]
org.elasticsearch.transport.RemoteTransportException: [node-0][127.0.0.1:37085][cluster:admin/xpack/ml/job/update]
Caused by: org.elasticsearch.index.engine.VersionConflictEngineException: [doc][anomaly_detector-restore-model-snapshot-job]: version conflict, current version [7] is different than the one provided [6]

From the order of the log messages AutodetectCommunicator logs [restore-model-snapshot-job] job closed before the update fails so there is a problem in AutoDetectResultProcessor.awaitCompletion() in that it has not waited properly for completion probably because jobUpdateSemaphore can be acquired before the update starts. It is a bit racy this class and needs a rethink

@droberts195
Copy link
Contributor

I had a PR build fail with a similar version conflict exception: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/3034/console

11:39:25 [2018-12-20T11:38:44,673][ERROR][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-1] [overall-buckets-test] Failed to update job with new established model memory [30196]
11:39:25 org.elasticsearch.index.engine.VersionConflictEngineException: [doc][anomaly_detector-overall-buckets-test]: version conflict, current version [5] is different than the one provided [4]

The job this happened to was:

11:39:25 [2018-12-20T11:33:05,769][WARN ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-1] [lookback-job-stopped-then-killed] some results not processed due to the process being killed

So it's not surprising that results were still being processed at the time the job completion code ran.

It's further evidence of a race though, and now these updates are not going via the cluster state thread we need to be more careful to ensure updates to the job only come from one thread at any phase in the job lifecycle.

droberts195 pushed a commit that referenced this issue Dec 20, 2018
#36856)

There was a race where the job update in `AutoDetectResultProcessor.updateEstablishedModelMemoryOnJob`
could execute after `AutoDetectResultProcessor.awaitCompletion`
returned. This was because ` jobUpdateSemaphore` was acquired
after the call to `jobResultsProvider.getEstablishedMemoryUsage`
and during that call `awaitCompletion` is free to acquire and
release the semaphore after which the method returns.

This commit fixes the problem.

Closes #36849
droberts195 pushed a commit that referenced this issue Dec 20, 2018
#36856)

There was a race where the job update in `AutoDetectResultProcessor.updateEstablishedModelMemoryOnJob`
could execute after `AutoDetectResultProcessor.awaitCompletion`
returned. This was because ` jobUpdateSemaphore` was acquired
after the call to `jobResultsProvider.getEstablishedMemoryUsage`
and during that call `awaitCompletion` is free to acquire and
release the semaphore after which the method returns.

This commit fixes the problem.

Closes #36849
davidkyle added a commit that referenced this issue Jan 2, 2019
Enable the previously muted test.
This reverts commit e31f80c.
@davidkyle
Copy link
Member Author

Closed by #36856 and the test was un-muted in 8d40f4b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants