Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Incorrect watch count in watcher stats api in tests #52453

Closed
martijnvg opened this issue Feb 18, 2020 · 9 comments
Closed

[CI] Incorrect watch count in watcher stats api in tests #52453

martijnvg opened this issue Feb 18, 2020 · 9 comments
Assignees
Labels
:Data Management/Watcher Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@martijnvg
Copy link
Member

martijnvg commented Feb 18, 2020

SmokeTestWatcherTestSuiteIT Failure:

java.lang.AssertionError: Watch count (from _watcher/stats) |  
-- | --
  | Expected: is <0> |  
  | but: was <1> |  
  | at __randomizedtesting.SeedInfo.seed([F316E92770C6654C:CE704783A70AC6]:0) |  
  | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  | at org.junit.Assert.assertThat(Assert.java:956) |  
  | at org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.assertWatchCount(SmokeTestWatcherTestSuiteIT.java:270) |  
  | at org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.lambda$testMonitorClusterHealth$3(SmokeTestWatcherTestSuiteIT.java:170) |  
  | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:880) |  
  | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:853) |  
  | at org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth(SmokeTestWatcherTestSuiteIT.java:170)

The failure matches with recent failures reported in #32299. The #51466 fix didn't make this test stop from failing.

The failure has failed a few times now and needs to be re-investigated.

Build failures:

WatchAckTests.testAckAllActions failure:

Build log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+part2/3568/console

Build scan: https://gradle-enterprise.elastic.co/s/ua3yon2njbyja

Failure:

java.lang.AssertionError: 
12:01:55     Expected: is <1L>
12:01:55          but: was <0L>
12:01:55         at __randomizedtesting.SeedInfo.seed([6DEA8487F6CD3068:BAC8F0266D2192C2]:0)
12:01:55         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
12:01:55         at org.junit.Assert.assertThat(Assert.java:956)
12:01:55         at org.junit.Assert.assertThat(Assert.java:923)
12:01:55         at org.elasticsearch.xpack.watcher.test.integration.WatchAckTests.testAckAllActions(WatchAckTests.java:136)
12:01:55 REPRODUCE WITH: ./gradlew ':x-pack:plugin:watcher:test' --tests "org.elasticsearch.xpack.watcher.test.integration.WatchAckTests.testAckAllActions" -Dtests.seed=6DEA8487F6CD3068 -Dtests.security.manager=true -Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH -Dtests.timezone=Africa/Bamako -Dcompiler.java=13
12:01:55 
12:01:55 Suite: Test class org.elasticsearch.xpack.watcher.test.integration.WatchAckTests

Reproduce with:

./gradlew ':x-pack:plugin:watcher:test' --tests "org.elasticsearch.xpack.watcher.test.integration.WatchAckTests.testAckAllActions" -Dtests.seed=6DEA8487F6CD3068 -Dtests.security.manager=true -Dtests.locale=th-TH-u-nu-thai-x-lvariant-TH -Dtests.timezone=Africa/Bamako -Dcompiler.java=13

Can't reproduce locally.

@martijnvg martijnvg added >test-failure Triaged test failures from CI :Data Management/Watcher labels Feb 18, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Watcher)

@martijnvg martijnvg self-assigned this Feb 18, 2020
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 21, 2020
Add watcher to trigger server after index operation has succeeded,
instead of adding a watch to trigger service before
the actual index operation has performed on the shard level.

This logic is simpler to reason about in the case that a failure
does occur during the execution of an index operation on
the shard level.

Relates to elastic#52453, but I think doesn't fix it, but makes it easier
to debug.
@martijnvg martijnvg changed the title SmokeTestWatcherTestSuiteIT#testMonitorClusterHealth test failure Incorrect watch count reported from watcher stats api in tests Feb 27, 2020
@martijnvg martijnvg changed the title Incorrect watch count reported from watcher stats api in tests [C] Incorrect watch count in watcher stats api in tests Feb 27, 2020
@martijnvg martijnvg changed the title [C] Incorrect watch count in watcher stats api in tests [CI] Incorrect watch count in watcher stats api in tests Feb 27, 2020
@martijnvg
Copy link
Member Author

Reported by @dakrone in #33326:

09:55:26 
09:55:26 org.elasticsearch.smoketest.SmokeTestWatcherWithSecurityClientYamlTestSuiteIT > test {yaml=watcher/usage/10_basic/Test watcher usage stats output} FAILED
09:55:26     java.lang.AssertionError: Failure at [watcher/usage/10_basic:48]: field [watcher.count.active] is not greater than [$watch_count_active]
09:55:26     Expected: a value greater than <1>
09:55:26          but: <1> was equal to <1>
09:55:26         at __randomizedtesting.SeedInfo.seed([3AA5E1A0040CF825:B2F1DE7AAAF095DD]:0)
09:55:26         at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:405)
09:55:26         at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:382)
09:55:26         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
09:55:26         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

I was unable to reproduce this on the 7.x branch.

Failure:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_RUNTIME_JAVA=zulu11,nodes=general-purpose/538/consoleFull
https://gradle-enterprise.elastic.co/s/xxhruxec3jeji

@martijnvg
Copy link
Member Author

martijnvg commented Feb 28, 2020

This (^) is another test that failed because incorrect stats counts are reported.

I suspect the main cause of these failures is that watcher, is not fully started on all shard instances that it serves watches from. More specifically the WatcherIndexingListener maybe inactive for a specific shard. We change the tests to ensure that watcher is fully started, but on the other hand we can change the put watch api to check whether the WatcherIndexingListener is active prior to indexing. If it not ready wait similar to the timeout on index request (waiting for enough shard copies to be ready prior to indexing)?

martijnvg added a commit that referenced this issue Mar 2, 2020
Add watcher to trigger server after index operation has succeeded,
instead of adding a watch to trigger service before
the actual index operation has performed on the shard level.

This logic is simpler to reason about in the case that a failure
does occur during the execution of an index operation on
the shard level.

Relates to #52453, but I think doesn't fix it, but makes it easier
to debug.
@mark-vieira
Copy link
Contributor

SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed twice again today. I assume due to the nature of the underlying cause that muting isn't practical.

@martijnvg
Copy link
Member Author

I want to see how these tests respond to #52627.

Otherwise I think we should investigate changing the watcher put and delete APIs to wait for the watch to be added to the trigger service before returning a response. Tests assume that this always happens, but that is not the case. In the meantime specific tests can be muted.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 3, 2020
Add watcher to trigger server after index operation has succeeded,
instead of adding a watch to trigger service before
the actual index operation has performed on the shard level.

This logic is simpler to reason about in the case that a failure
does occur during the execution of an index operation on
the shard level.

Relates to elastic#52453, but I think doesn't fix it, but makes it easier
to debug.
martijnvg added a commit that referenced this issue Mar 3, 2020
Backport: #52627

Add watcher to trigger server after index operation has succeeded,
instead of adding a watch to trigger service before
the actual index operation has performed on the shard level.

This logic is simpler to reason about in the case that a failure
does occur during the execution of an index operation on
the shard level.

Relates to #52453, but I think doesn't fix it, but makes it easier
to debug.
@imotov
Copy link
Contributor

imotov commented Mar 3, 2020

@martijnvg
Copy link
Member Author

martijnvg commented Mar 4, 2020

I will mute the SmokeTestWatcherTestSuiteIT#testMonitorClusterHealth for now in master and 7.x

Update: I've used the wrong issue id in the commit message 🤦‍♂
master, 7.x

martijnvg added a commit that referenced this issue Mar 4, 2020
martijnvg added a commit that referenced this issue Mar 4, 2020
@jbaiera
Copy link
Member

jbaiera commented Apr 3, 2020

There was a failure today that was related to #33326, which looks like this issue replaces.

https://gradle-enterprise.elastic.co/s/dhqipxj2iuyfc

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@jakelandis
Copy link
Contributor

There are no instances of these failures since May 11 (There are a couple SSL failures in a FIPs container ... but that is not what this issue is about)
This corresponds with #56556 was introduced to help address issues like this.

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Watcher Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

7 participants