Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex search resiliency prototype #43187

Conversation

henningandersen
Copy link
Contributor

@henningandersen henningandersen commented Jun 13, 2019

This PR contains a prototype for ensuring that reindex searching can survive data node restarts and/or related network issues.

Primarily created this PR to collaborate internally.

Most tests only modified to compile.

ReindexResilientSearchIT checks that resiliency works.

A large part of the change is changing the retry handling to now be handled by ScrollableHitSource instead of the two subclasses. This ensures it is handled the same way and made it easier to restart the query from scratch. The subclasses main responsibility is not to execute one search or scroll request and convert the result to common format.

Notable questions:

  1. Throttling, seems desirable to change it to be based on batch-size rather than amount of bulk-requests after scripting. This would make the extra amount to add to scroll timeout the same (though it would change on rethrottle) but also seems like the right choice for users.
  2. How to clone a SearchRequest
  3. Is there a good way to add the _seq_no >= retryFromValue condition if there is an existing condition.
  4. Same for remote - where we currently just have the bytes and let the remote handle parsing. And for sorting here.
  5. ThreadContext handling may not be complete, though I would expect transport and threadPool to handle this automatically?

PRs to split out:

  1. The test refactorings make sense regardless, Saves some lines of code.
  2. Changing ScrollableHitSource to pump out data, rather than the more intimate relationship to AbstractAsyncBulkByScrollAction
  3. The change to handle retry in ScrollableHitSource

This commit is just the (first) integTest for validating that reindex
can survive data node restarts.
@henningandersen henningandersen changed the title Reindex search resilient test case Reindex search resiliency prototype Jun 14, 2019

ensureGreen("test");
String reindexNode = internalCluster().startCoordinatingOnlyNode(Settings.EMPTY);
NodeClient reindexNodeClient = internalCluster().getInstance(NodeClient.class, reindexNode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internalCluster().client(reindexNode) should be simpler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is done like this is to get the NodeClient since it has executeLocally, which returns a Task object.


private void assertSameDocs(int numberOfDocuments, String... indices) {
refresh(indices);
SearchSourceBuilder sourceBuilder = SearchSourceBuilder.searchSource().size(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps just do a count or use a cardinality aggregation on the "data" field? No need for scripting I think.

} else if (retryFromRequest.source().query() == null) {
retryFromRequest.source().query(rangeQueryBuilder);
} else {
throw new UnsupportedOperationException("not yet implemented");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps wrap original query using BoolQueryBuilder where the rangeQueryBuilder is added as filter and the original query as must.

henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Jul 2, 2019
Refactor ScrollableHitSource to pump data out and have a simplified
interface (callers should no longer call startNextScroll, instead they
simply mark that they are done with the previous result, triggering a
new batch of data). This eases making reindex resilient, since we will
sometimes need to rerun search during retries.

Relates elastic#43187 and elastic#42612
henningandersen added a commit that referenced this pull request Jul 9, 2019
Refactor ScrollableHitSource to pump data out and have a simplified
interface (callers should no longer call startNextScroll, instead they
simply mark that they are done with the previous result, triggering a
new batch of data). This eases making reindex resilient, since we will
sometimes need to rerun search during retries.

Relates #43187 and #42612
henningandersen added a commit that referenced this pull request Jul 9, 2019
Refactor ScrollableHitSource to pump data out and have a simplified
interface (callers should no longer call startNextScroll, instead they
simply mark that they are done with the previous result, triggering a
new batch of data). This eases making reindex resilient, since we will
sometimes need to rerun search during retries.

Relates #43187 and #42612
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Aug 13, 2019
Local reindex can now survive loosing data nodes that contain source
data. The original query will be restarted with a filter for
`_seq_no >= last_seq_no` when a failure is detected.

Part of elastic#42612 and split out from elastic#43187
henningandersen added a commit that referenced this pull request Aug 19, 2019
Local reindex can now survive loosing data nodes that contain source
data. The original query will be restarted with a filter for
`_seq_no >= last_seq_no` when a failure is detected.

The original/first search request is not retried/restarted since this is not
what we used to do and it leads to long wait times to get the info back
that a search request is bad.

Part of #42612 and split out from #43187
@henningandersen
Copy link
Contributor Author

This is all covered in split out PRs, mainly #45497 so closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants