Reindex search resiliency prototype #43187

henningandersen · 2019-06-13T06:58:30Z

This PR contains a prototype for ensuring that reindex searching can survive data node restarts and/or related network issues.

Primarily created this PR to collaborate internally.

Most tests only modified to compile.

ReindexResilientSearchIT checks that resiliency works.

A large part of the change is changing the retry handling to now be handled by ScrollableHitSource instead of the two subclasses. This ensures it is handled the same way and made it easier to restart the query from scratch. The subclasses main responsibility is not to execute one search or scroll request and convert the result to common format.

Notable questions:

Throttling, seems desirable to change it to be based on batch-size rather than amount of bulk-requests after scripting. This would make the extra amount to add to scroll timeout the same (though it would change on rethrottle) but also seems like the right choice for users.
How to clone a SearchRequest
Is there a good way to add the _seq_no >= retryFromValue condition if there is an existing condition.
Same for remote - where we currently just have the bytes and let the remote handle parsing. And for sorting here.
ThreadContext handling may not be complete, though I would expect transport and threadPool to handle this automatically?

PRs to split out:

The test refactorings make sense regardless, Saves some lines of code.
Changing ScrollableHitSource to pump out data, rather than the more intimate relationship to AbstractAsyncBulkByScrollAction
The change to handle retry in ScrollableHitSource

This commit is just the (first) integTest for validating that reindex can survive data node restarts.

This cements the SPI to be the individual operations and the capability to convert the results and failures to common format.

ywelsch · 2019-06-18T09:48:04Z

modules/reindex/src/test/java/org/elasticsearch/index/reindex/ReindexResilientSearchIT.java

+
+        ensureGreen("test");
+        String reindexNode = internalCluster().startCoordinatingOnlyNode(Settings.EMPTY);
+        NodeClient reindexNodeClient = internalCluster().getInstance(NodeClient.class, reindexNode);


internalCluster().client(reindexNode) should be simpler

The reason this is done like this is to get the NodeClient since it has executeLocally, which returns a Task object.

ywelsch · 2019-06-18T11:28:55Z

modules/reindex/src/test/java/org/elasticsearch/index/reindex/ReindexResilientSearchIT.java

+
+    private void assertSameDocs(int numberOfDocuments, String... indices) {
+        refresh(indices);
+        SearchSourceBuilder sourceBuilder = SearchSourceBuilder.searchSource().size(0)


perhaps just do a count or use a cardinality aggregation on the "data" field? No need for scripting I think.

ywelsch · 2019-06-18T11:55:56Z

server/src/main/java/org/elasticsearch/index/reindex/ClientScrollableHitSource.java

+        } else if (retryFromRequest.source().query() == null) {
+            retryFromRequest.source().query(rangeQueryBuilder);
+        } else {
+            throw new UnsupportedOperationException("not yet implemented");


perhaps wrap original query using BoolQueryBuilder where the rangeQueryBuilder is added as filter and the original query as must.

and add seq_no range as filter.

…rch_resiliency_poc

Refactor ScrollableHitSource to pump data out and have a simplified interface (callers should no longer call startNextScroll, instead they simply mark that they are done with the previous result, triggering a new batch of data). This eases making reindex resilient, since we will sometimes need to rerun search during retries. Relates elastic#43187 and elastic#42612

Refactor ScrollableHitSource to pump data out and have a simplified interface (callers should no longer call startNextScroll, instead they simply mark that they are done with the previous result, triggering a new batch of data). This eases making reindex resilient, since we will sometimes need to rerun search during retries. Relates #43187 and #42612

Local reindex can now survive loosing data nodes that contain source data. The original query will be restarted with a filter for `_seq_no >= last_seq_no` when a failure is detected. Part of elastic#42612 and split out from elastic#43187

Local reindex can now survive loosing data nodes that contain source data. The original query will be restarted with a filter for `_seq_no >= last_seq_no` when a failure is detected. The original/first search request is not retried/restarted since this is not what we used to do and it leads to long wait times to get the info back that a search request is bad. Part of #42612 and split out from #43187

henningandersen · 2019-08-23T17:08:04Z

This is all covered in split out PRs, mainly #45497 so closing this.

Reindex search resilient test case

4120814

This commit is just the (first) integTest for validating that reindex can survive data node restarts.

henningandersen added the WIP label Jun 13, 2019

henningandersen added 2 commits June 13, 2019 12:10

Checkstyle issues.

6e95746

First rough prototype.

30dada8

henningandersen changed the title ~~Reindex search resilient test case~~ Reindex search resiliency prototype Jun 14, 2019

henningandersen added 8 commits June 14, 2019 13:00

Unify logging between client and remote.

73cd45e

Moved retry to super class (ScrollableHitSource)

39c1187

This cements the SPI to be the individual operations and the capability to convert the results and failures to common format.

Fixed test compilation.

24045af

Only works for non-remote for now.

55a95ea

Cleanup code.

36f5645

Private and final fields plus javadoc.

aded641

Whitespace

0ce0214

Naming.

240778f

henningandersen requested review from ywelsch and Tim-Brooks June 14, 2019 16:21

Removed left-over todo.

7ae3004

ywelsch reviewed Jun 18, 2019

View reviewed changes

henningandersen added 5 commits June 20, 2019 13:48

Wrap existing query in bool query

3e1d904

and add seq_no range as filter.

Do not use scripting.

4813c19

Preserve context on schedule

fa63b28

Merge remote-tracking branch 'origin/master' into enhance_reindex_sea…

db47dda

…rch_resiliency_poc

Remove todo.

92e47d1

henningandersen mentioned this pull request Jul 2, 2019

Reindex ScrollableHitSource pump data out #43864

Merged

henningandersen mentioned this pull request Aug 13, 2019

Reindex search resiliency #45497

Merged

henningandersen closed this Aug 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindex search resiliency prototype #43187

Reindex search resiliency prototype #43187

henningandersen commented Jun 13, 2019 •

edited

Loading

ywelsch Jun 18, 2019

henningandersen Jun 24, 2019

ywelsch Jun 18, 2019

ywelsch Jun 18, 2019

henningandersen commented Aug 23, 2019

Reindex search resiliency prototype #43187

Reindex search resiliency prototype #43187

Conversation

henningandersen commented Jun 13, 2019 • edited Loading

ywelsch Jun 18, 2019

Choose a reason for hiding this comment

henningandersen Jun 24, 2019

Choose a reason for hiding this comment

ywelsch Jun 18, 2019

Choose a reason for hiding this comment

ywelsch Jun 18, 2019

Choose a reason for hiding this comment

henningandersen commented Aug 23, 2019

henningandersen commented Jun 13, 2019 •

edited

Loading