Avoid double term construction in DfsPhase #38716

romseygeek · 2019-02-11T12:41:46Z

DfsPhase captures terms used for scoring a query in order to build global term statistics across
multiple shards for more accurate scoring. It currently does this by building the query's Weight and
calling extractTerms on it to collect terms, and then calling IndexSearcher.termStatistics() for each
collected term. This duplicates work, however, as the various Weight implementations will already
have collected these statistics at construction time.

This commit replaces this round-about way of collecting stats, instead using a delegating
IndexSearcher that collects the term contexts and statistics when IndexSearcher.termStatistics()
is called from the Weight.

As this also applies to rescorers, the signature of RescoreContext.extractTerms has changed slightly.

elasticmachine · 2019-02-11T12:41:47Z

Pinging @elastic/es-search

romseygeek · 2019-02-11T12:43:32Z

A follow up to this could be altering DfsResult to store a map of Term to TermStatistics objects, rather than parallel arrays. I'm also wondering why the code uses an ObjectObjectHashMap here, as opposed to just a standard Java HashMap?

romseygeek · 2019-02-11T16:12:35Z

@elasticmachine run elasticsearch-ci/1

jimczi

I left some comments, while I understand why you did that I am not sure that the gain in performance outweigh the API change required to do it.

jimczi · 2019-02-12T15:31:17Z

.../examples/rescore/src/main/java/org/elasticsearch/example/rescore/ExampleRescoreBuilder.java

@@ -225,7 +223,7 @@ public Explanation explain(int topLevelDocId, IndexSearcher searcher, RescoreCon
        }

        @Override
-        public void extractTerms(IndexSearcher searcher, RescoreContext rescoreContext, Set<Term> termsSet) {
+        public void extractTerms(IndexSearcher searcher, RescoreContext rescoreContext) {


The new signature is a bit weird, the only option is to call createWeight on the searcher but it's obfuscated so you need to check an actual implementation to realize that.

jimczi · 2019-02-12T15:43:02Z

server/src/main/java/org/elasticsearch/search/dfs/DfsPhase.java

-                if (fieldStatistics.containsKey(term.field()) == false) {
-                    final CollectionStatistics collectionStatistics = context.searcher().collectionStatistics(term.field());
-                    if (collectionStatistics != null) {
-                        fieldStatistics.put(term.field(), collectionStatistics);


I wonder if this is really equivalent. Some queries are going to build term statistics even though they don't add terms in extractTerms for various reasons (GlobalOrdinalsQuery for instance) so we'll end up with more terms than before. However the good side of this change is that we'd extract only the terms that are used for scoring instead of all terms that are present in the query.
The API change for the rescorer is too obfuscated IMO but one thing I fully agree with is that we don't need the special map so we could rely on plain HashMap to cleanup the code a bit ?

Some queries are going to build term statistics even though they don't add terms in extractTerms for various reasons (GlobalOrdinalsQuery for instance)

I think this is incorrect - queries will only build term statistics if they actually need them, whereas extractTerms doesn't always track that. TermWeight for example always adds its term, even if it has been created with ScoreMode.COMPLETE_NO_SCORES, while GlobalOrdinalsQuery creates its child weight with COMPLETE_NO_SCORES so no term stats will be pulled here.

I agree about the API change on rescorer though, let me think about a better way to do that.

romseygeek · 2019-02-14T14:34:35Z

I've pushed a change removing extractTerms() entirely, and instead adding List<Query> getQueries() to RescoreContext, returning an empty list by default, and a singleton in QueryRescorer.

extractTerms() was partially broken anyway, as it was only collecting term stats, not field stats.

jimczi

I like the new approach , thanks @romseygeek

* elastic/master: Avoid double term construction in DfsPhase (elastic#38716) Fix typo in DateRange docs (yyy → yyyy) (elastic#38883) Introduced class reuses follow parameter code between ShardFollowTasks (elastic#38910) Ensure random timestamps are within search boundary (elastic#38753) [CI] Muting method testFollowIndex in IndexFollowingIT Update Lucene snapshot repo for 7.0.0-beta1 (elastic#38946) SQL: Doc on syntax (identifiers in particular) (elastic#38662) Upgrade to Gradle 5.2.1 (elastic#38880) Tie break search shard iterator comparisons on cluster alias (elastic#38853) Also mmap cfs files for hybridfs (elastic#38940) Build: Fix issue with test status logging (elastic#38799) Adapt FullClusterRestartIT on master (elastic#38856) Fix testAutoFollowing test to use createLeaderIndex() helper method. Migrate muted auto follow rolling upgrade test and unmute this test (elastic#38900) ShardBulkAction ignore primary response on primary (elastic#38901) Recover peers from translog, ignoring soft deletes (elastic#38904) Fix NPE on Stale Index in IndicesService (elastic#38891) Smarter CCR concurrent file chunk fetching (elastic#38841) Fix intermittent failure in ApiKeyIntegTests (elastic#38627) re-enable SmokeTestWatcherWithSecurityIT (elastic#38814)

DfsPhase captures terms used for scoring a query in order to build global term statistics across multiple shards for more accurate scoring. It currently does this by building the query's `Weight` and calling `extractTerms` on it to collect terms, and then calling `IndexSearcher.termStatistics()` for each collected term. This duplicates work, however, as the various `Weight` implementations will already have collected these statistics at construction time. This commit replaces this round-about way of collecting stats, instead using a delegating IndexSearcher that collects the term contexts and statistics when `IndexSearcher.termStatistics()` is called from the Weight. It also fixes a bug when using rescorers, where a `QueryRescorer` would calculate distributed term statistics, but ignore field statistics. `Rescorer.extractTerms` has been removed, and replaced with a new method on `RescoreContext` that returns any queries used by the rescore implementation. The delegating IndexSearcher then collects term contexts and statistics in the same way described above for each Query.

Avoid double term construction in DfsPhase

9d9df4d

romseygeek added :Search Relevance/Ranking Scoring, rescoring, rank evaluation. >refactoring v7.2.0 labels Feb 11, 2019

romseygeek self-assigned this Feb 11, 2019

romseygeek requested a review from jimczi February 11, 2019 12:41

Null checks

d8de607

jimczi reviewed Feb 12, 2019

View reviewed changes

Remove extractTerms(), add getQueries() to RescoreContext

76f50a5

jimczi approved these changes Feb 15, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into dfs-termstats

3c392fb

romseygeek merged commit 38d2935 into elastic:master Feb 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid double term construction in DfsPhase #38716

Avoid double term construction in DfsPhase #38716

romseygeek commented Feb 11, 2019

elasticmachine commented Feb 11, 2019

romseygeek commented Feb 11, 2019

romseygeek commented Feb 11, 2019

jimczi left a comment

jimczi Feb 12, 2019

jimczi Feb 12, 2019

romseygeek Feb 13, 2019

romseygeek commented Feb 14, 2019

jimczi left a comment

Avoid double term construction in DfsPhase #38716

Avoid double term construction in DfsPhase #38716

Conversation

romseygeek commented Feb 11, 2019

elasticmachine commented Feb 11, 2019

romseygeek commented Feb 11, 2019

romseygeek commented Feb 11, 2019

jimczi left a comment

Choose a reason for hiding this comment

jimczi Feb 12, 2019

Choose a reason for hiding this comment

jimczi Feb 12, 2019

Choose a reason for hiding this comment

romseygeek Feb 13, 2019

Choose a reason for hiding this comment

romseygeek commented Feb 14, 2019

jimczi left a comment

Choose a reason for hiding this comment