Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Failures in IndexShardTests#testRollbackReplicaEngineOnPromotion #35850

Closed
cbuescher opened this issue Nov 23, 2018 · 7 comments · Fixed by #36116
Closed

[CI] Failures in IndexShardTests#testRollbackReplicaEngineOnPromotion #35850

cbuescher opened this issue Nov 23, 2018 · 7 comments · Fixed by #36116
Assignees
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI

Comments

@cbuescher
Copy link
Member

Build failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/263/console

java.lang.AssertionError: 
Expected: <70L>
     but: was <79L>
	at __randomizedtesting.SeedInfo.seed([751069218676973A:B78A6A21D330A536]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.index.shard.IndexShardTests.testRollbackReplicaEngineOnPromotion(IndexShardTests.java:1129)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)

Reproduce line (doesn't reproduce locally for me so far):

 ./gradlew :server:test -Dtests.seed=751069218676973A -Dtests.class=org.elasticsearch.index.shard.IndexShardTests -Dtests.method="testRollbackReplicaEngineOnPromotion" -Dtests.security.manager=true -Dtests.locale=en-CA -Dtests.timezone=Australia/Perth -Dcompiler.java=11 -Druntime.java=8
@cbuescher cbuescher added >test-failure Triaged test failures from CI :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Nov 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@cbuescher
Copy link
Member Author

Also we ran into this in a PR test over at #34112, there the error looks slightly different though:

Build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/220/console

11:25:37 FAILURE 0.68s J6 | IndexShardTests.testRestoreLocalHistoryFromTranslogOnPromotion <<< FAILURES!
11:25:37    > Throwable #1: java.lang.AssertionError: 
11:25:37    > Expected: <673L>
11:25:37    >      but: was <30L>
11:25:37    > 	at __randomizedtesting.SeedInfo.seed([A945C21B27945AD0:198FBCBAD6556E29]:0)
11:25:37    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
11:25:37    > 	at org.elasticsearch.index.shard.IndexShardTests.testRestoreLocalHistoryFromTranslogOnPromotion(IndexShardTests.java:1081)
11:25:37    > 	at java.lang.Thread.run(Thread.java:748)
./gradlew :server:test -Dtests.seed=A945C21B27945AD0 -Dtests.class=org.elasticsearch.index.shard.IndexShardTests -Dtests.method="testRestoreLocalHistoryFromTranslogOnPromotion" -Dtests.security.manager=true -Dtests.locale=ar-BH -Dtests.timezone=Asia/Aqtobe -Dcompiler.java=11 -Druntime.java=8

Note this is a different method ("testRestoreLocalHistoryFromTranslogOnPromotion") and on master, but same test class and at least correlated by time. Maybe related, just filing here in case this gives further hints.

@cbuescher
Copy link
Member Author

Another one that looks related, this time on master and again in "testRestoreLocalHistoryFromTranslogOnPromotion":

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos/76/console

./gradlew :server:test -Dtests.seed=A2BDA22B8BC2A5C6 -Dtests.class=org.elasticsearch.index.shard.IndexShardTests -Dtests.method="testRestoreLocalHistoryFromTranslogOnPromotion" -Dtests.security.manager=true -Dtests.locale=zh-HK -Dtests.timezone=America/Argentina/Mendoza -Dcompiler.java=11 -Druntime.java=8
12:31:00    > Throwable #1: java.lang.AssertionError: 
[...]
12:31:00    > Expected: <491L>
12:31:00    >      but: was <110L>
12:31:00    > 	at __randomizedtesting.SeedInfo.seed([A2BDA22B8BC2A5C6:1277DC8A7A03913F]:0)
12:31:00    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
12:31:00    > 	at org.elasticsearch.index.shard.IndexShardTests.testRestoreLocalHistoryFromTranslogOnPromotion(IndexShardTests.java:1081)
12:31:00    > 	at java.lang.Thread.run(Thread.java:748)

cbuescher pushed a commit that referenced this issue Nov 23, 2018
@cbuescher
Copy link
Member Author

I muted both IndexShardTests#testRestoreLocalHistoryFromTranslogOnPromotion and IndexShardTests#testRollbackReplicaEngineOnPromotion on master and 6.x

cbuescher pushed a commit that referenced this issue Nov 23, 2018
@tlrx
Copy link
Member

tlrx commented Nov 23, 2018

Thanks @cbuescher !

@iverase
Copy link
Contributor

iverase commented Nov 26, 2018

The were some errors that seems related with this one. They reproduce locally for master and 6.x:

./gradlew :server:test   -Dtests.seed=91B36D5827A7C41   -Dtests.class=org.elasticsearch.index.shd.IndexShardTests   -Dtests.method="testOperationPermitOnReplicaShards"   -Dtests.security.manager=true   -Dtests.locale=ro   -Dtests.timezone=America/Scoresbysund   -Dcompiler.java=11   -Druntime.java=8

Output:

 1> [2018-11-26T06:04:01,946][INFO ][o.e.i.s.IndexShardTests  ] [testOperationPermitOnReplicaShards] before test
  1> [2018-11-26T06:04:03,582][INFO ][o.e.i.s.IndexShardTests  ] [testOperationPermitOnReplicaShards] shard routing to [index][0], node[tDRGSLjrNj], [R], s[STARTED], a[id=-RB7P6EUTeKztb4anytFxg]
  2> Nov 26, 2018 6:04:03 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
  2> WARNING: Uncaught exception in thread: Thread[elasticsearch[org.elasticsearch.index.shard.IndexShardTests][generic][T#2],5,TGRP-IndexShardTests]
  2> java.lang.AssertionError: 
  2> Expected: <78L>
  2>      but: was <74L>
  2>    at __randomizedtesting.SeedInfo.seed([91B36D5827A7C41]:0)
  2>    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
  2>    at org.junit.Assert.assertThat(Assert.java:956)
  2>    at org.junit.Assert.assertThat(Assert.java:923)
  2>    at org.elasticsearch.index.shard.IndexShardTests$8.onResponse(IndexShardTests.java:841)
  2>    at org.elasticsearch.index.shard.IndexShardTests$8.onResponse(IndexShardTests.java:837)
  2>    at org.elasticsearch.index.shard.IndexShard$4.onResponse(IndexShard.java:2506)
  2>    at org.elasticsearch.index.shard.IndexShard$4.onResponse(IndexShard.java:2484)
  2>    at org.elasticsearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:146)
  2>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:726)
  2>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  2>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  2>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  2>    at java.lang.Thread.run(Thread.java:748)
HEARTBEAT J0 PID(48961@ignacios-MBP): 2018-11-26T08:04:12, stalled for 10.3s at: IndexShardTests.testOperationPermitOnReplicaShards
HEARTBEAT J0 PID(48961@ignacios-MBP): 2018-11-26T08:04:22, stalled for 20.3s at: IndexShardTests.testOperationPermitOnReplicaShards
HEARTBEAT J0 PID(48961@ignacios-MBP): 2018-11-26T08:04:32, stalled for 30.3s at: IndexShardTests.testOperationPermitOnReplicaShards
HEARTBEAT J0 PID(48961@ignacios-MBP): 2018-11-26T08:04:42, stalled for 40.3s at: IndexShardTests.testOperationPermitOnReplicaShards
<============-> 97% EXECUTING [58s]
> :server:test > Suites: 0/1
> :server:test > Tests: completed: 0, failed: 0, ignored: 0
> :server:test > J0: IndexShardTests.testOperationPermitOnReplicaShards stalled for 40.3s

It actually leads to a timeout. I am not muting the test because fix seems to be applied soon.

@iverase
Copy link
Contributor

iverase commented Nov 26, 2018

Test IndexShardTests.testOperationPermitOnReplicaShards has been muted in master and 6.x

tlrx added a commit that referenced this issue Dec 4, 2018
…term update (#36116)

This commit changes how an operation which requires all index shard 
operations permits is executed when a primary term update is required: 
the operation and the update are combined so that the operation is 
executed after the primary term update under the same blocking 
operation.

Closes #35850

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
tlrx added a commit that referenced this issue Dec 4, 2018
…term update (#36116)

This commit changes how an operation which requires all index shard
operations permits is executed when a primary term update is required:
the operation and the update are combined so that the operation is
executed after the primary term update under the same blocking
operation.

Closes #35850

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants