Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IndexAuditTrail rolling upgrade on rollover edge - take 2 #38286

Conversation

albertzaharovits
Copy link
Contributor

@albertzaharovits albertzaharovits commented Feb 3, 2019

Fixes a race during the rolling upgrade with the index audit output enabled.

The race is that after the upgraded node is restarted, it installs the audit template and updates the mapping of the "current" (from his perspective) audit index. But the template might be installed after a new daily rolled-over index has been created by the other old nodes, using the old templates.
However, the new node, even if it installs the template after the rollover edge, can accumulate audit
events before the edge, and will correctly try to update the mapping of the audit index before the edge. But this way, the mapping of the index after the edge remains un-updated, because only the master node does the mapping updates.

The fix keeps the design of only allowing the master to update the mapping, but the master will try, on a best effort policy, to also possibly update the mapping of the next rollover audit index.

This can be judged as a shoot in the dark because I don't have access to the failure data anymore, but I think the crumbles point in this direction. Moreover, turning up debugging will allow for easier future diagnosis.

Relates #35988

Closes #33867 #37062

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@@ -337,7 +349,7 @@ public void onResponse(ClusterStateResponse clusterStateResponse) {
updateCurrentIndexMappingsIfNecessary(clusterStateResponse.getState());
} else if (TemplateUtils.checkTemplateExistsAndVersionMatches(INDEX_TEMPLATE_NAME,
SECURITY_VERSION_STRING, clusterStateResponse.getState(), logger,
Version.CURRENT::onOrAfter) == false) {
Version.CURRENT::onOrBefore) == false) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up of #35988 , but does not affect the failures #33867 #37062 .

transitionStartingToInitialized();
}
} else {
@SuppressWarnings("unchecked")
Map<String, Object> meta = (Map<String, Object>) docMapping.sourceAsMap().get("_meta");
if (meta == null) {
logger.info("Missing _meta field in mapping [{}] of index [{}]", docMapping.type(), index);
throw new IllegalStateException("Cannot read security-version string in index " + index);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes #37062 (comment) .

A non-master node detects an un-updated audit index and bails. Instead it should hold off, and retry. The index is un-updated because the master had updated the mapping for the index before it the rollover timeline ("the race" - the template upgrade happend after the rollover edge, but audit events on the master came before that).

innerStart();
}, e2 -> {
// best effort only
logger.debug("Failed to update mappings on next audit index [{}]", nextIndex, e2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master tries to update the mapping for the next rollover index, just in case....

@@ -217,6 +218,7 @@ subprojects {
setting 'xpack.security.enabled', 'true'
setting 'xpack.security.transport.ssl.enabled', 'true'
setting 'xpack.security.transport.ssl.keystore.path', 'testnode.jks'
setting 'logger.org.elasticsearch.xpack.security.audit.index', 'DEBUG'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should help in future possible failures!

public void testAuditLogs() throws Exception {
assertBusy(() -> {
assertAuditDocsExist();
assertNumUniqueNodeNameBuckets(expectedNumUniqueNodeNameBuckets());
});
}, 30, TimeUnit.SECONDS);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows some slack for creating and allocating a new audit index by the old nodes while the master is down for upgrade.

@albertzaharovits
Copy link
Contributor Author

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/7058/console

23:25:34 2> REPRODUCE WITH: ./gradlew :x-pack:plugin:ml:internalClusterTest -Dtests.seed=9EEB9D5D3C6970C7 -Dtests.class=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT -Dtests.security.manager=true -Dtests.locale=en-US -Dtests.timezone=UTC -Dcompiler.java=11 -Druntime.java=8
23:25:34 > at java.lang.Thread.run(Thread.java:748)
23:25:34 > 64) Thread[id=742, name=elasticsearch[node_t0][search][T#3], state=WAITING, group=TGRP-MlDistributedFailureIT]
23:25:34 > at sun.misc.Unsafe.park(Native Method)
23:25:34 2> REPRODUCE WITH: ./gradlew :x-pack:plugin:ml:internalClusterTest -Dtests.seed=9EEB9D5D3C6970C7 -Dtests.class=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT -Dtests.security.manager=true -Dtests.locale=en-US -Dtests.timezone=UTC -Dcompiler.java=11 -Druntime.java=8

@elasticmachine run elasticsearch-ci/2

Copy link
Member

@jaymode jaymode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@albertzaharovits albertzaharovits merged commit 6f553ab into elastic:6.x Feb 4, 2019
@albertzaharovits albertzaharovits deleted the fix-index-audit-trail-upgrade-take-2 branch February 4, 2019 23:05
albertzaharovits added a commit to albertzaharovits/elasticsearch that referenced this pull request Feb 4, 2019
Fixes a race during the rolling upgrade with the index audit output enabled.

The race is that after the upgraded node is restarted, it installs the audit template
and updates the mapping of the "current" (from his perspective) audit index. But
the template might be installed after a new daily rolled-over index has been
created by the other old nodes, using the old templates.

However, the new node, even if it installs the template after the rollover edge,
can accumulate audit events before the edge, and will correctly try to update the
mapping of the audit index before the edge. But this way, the mapping of the index
after the edge remains un-updated, because only the master node does the
mapping updates.

The fix keeps the design of only allowing the master to update the mapping, but
the master will try, on a best effort policy, to also possibly update the mapping of
the next rollover audit index.
albertzaharovits added a commit that referenced this pull request Feb 5, 2019
Fixes a race during the rolling upgrade with the index audit output enabled.

The race is that after the upgraded node is restarted, it installs the audit template
and updates the mapping of the "current" (from his perspective) audit index. But
the template might be installed after a new daily rolled-over index has been
created by the other old nodes, using the old templates.

However, the new node, even if it installs the template after the rollover edge,
can accumulate audit events before the edge, and will correctly try to update the
mapping of the audit index before the edge. But this way, the mapping of the index
after the edge remains un-updated, because only the master node does the
mapping updates.

The fix keeps the design of only allowing the master to update the mapping, but
the master will try, on a best effort policy, to also possibly update the mapping of
the next rollover audit index.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 8, 2019
* 6.6: (121 commits)
  [DOCS] Add warning about bypassing ML PUT APIs (elastic#38608)
  fix dissect doc "ip" --> "clientip" (elastic#38512)
  bad formatted JSON object (elastic#38515)
  SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440)
  Update ilm-api.asciidoc, point to REMOVE policy (elastic#38235)
  Backport changes to the release notes script. (elastic#38347)
  Change the milliseconds precision to 3 digits for intervals. (elastic#38297)
  SecuritySettingsSource license.self_generated: trial (elastic#38233) (elastic#38398)
  Fix IndexAuditTrail rolling upgrade on rollover edge 2 (elastic#38286) (elastic#38381)
  Cleanup construction of interceptors (elastic#38388)
  Skip unsupported languages for tests (elastic#38328) (elastic#38385)
  [ILM][TEST] increase assertBusy timeout (elastic#36864) (elastic#38354)
  Docs: Drop inline callout from scroll example (elastic#38340) (elastic#38365)
  Preserve ILM operation mode when creating new lifecycles (elastic#38134) (elastic#38230)
  [ML] Add explanation so far to file structure finder exceptions (elastic#38337)
  ML: Fix error race condition on stop _all datafeeds and close _all jobs (elastic#38113) (elastic#38211) (elastic#38222)
  SQL: Generate relevant error message when grouping functions are not used in GROUP BY (elastic#38017)
  Fix NPE in Logfile Audit Filter (elastic#38120) (elastic#38273)
  Enable trace log in FollowerFailOverIT (elastic#38148)
  Replace awaitBusy with assertBusy in atLeastDocsIndexed (elastic#38190)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants