Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double term construction in DfsPhase #38716

Merged
merged 4 commits into from
Feb 15, 2019

Conversation

romseygeek
Copy link
Contributor

DfsPhase captures terms used for scoring a query in order to build global term statistics across
multiple shards for more accurate scoring. It currently does this by building the query's Weight and
calling extractTerms on it to collect terms, and then calling IndexSearcher.termStatistics() for each
collected term. This duplicates work, however, as the various Weight implementations will already
have collected these statistics at construction time.

This commit replaces this round-about way of collecting stats, instead using a delegating
IndexSearcher that collects the term contexts and statistics when IndexSearcher.termStatistics()
is called from the Weight.

As this also applies to rescorers, the signature of RescoreContext.extractTerms has changed slightly.

@romseygeek romseygeek added :Search Relevance/Ranking Scoring, rescoring, rank evaluation. >refactoring v7.2.0 labels Feb 11, 2019
@romseygeek romseygeek self-assigned this Feb 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@romseygeek
Copy link
Contributor Author

A follow up to this could be altering DfsResult to store a map of Term to TermStatistics objects, rather than parallel arrays. I'm also wondering why the code uses an ObjectObjectHashMap here, as opposed to just a standard Java HashMap?

@romseygeek
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/1

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, while I understand why you did that I am not sure that the gain in performance outweigh the API change required to do it.

@@ -225,7 +223,7 @@ public Explanation explain(int topLevelDocId, IndexSearcher searcher, RescoreCon
}

@Override
public void extractTerms(IndexSearcher searcher, RescoreContext rescoreContext, Set<Term> termsSet) {
public void extractTerms(IndexSearcher searcher, RescoreContext rescoreContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new signature is a bit weird, the only option is to call createWeight on the searcher but it's obfuscated so you need to check an actual implementation to realize that.

if (fieldStatistics.containsKey(term.field()) == false) {
final CollectionStatistics collectionStatistics = context.searcher().collectionStatistics(term.field());
if (collectionStatistics != null) {
fieldStatistics.put(term.field(), collectionStatistics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is really equivalent. Some queries are going to build term statistics even though they don't add terms in extractTerms for various reasons (GlobalOrdinalsQuery for instance) so we'll end up with more terms than before. However the good side of this change is that we'd extract only the terms that are used for scoring instead of all terms that are present in the query.
The API change for the rescorer is too obfuscated IMO but one thing I fully agree with is that we don't need the special map so we could rely on plain HashMap to cleanup the code a bit ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some queries are going to build term statistics even though they don't add terms in extractTerms for various reasons (GlobalOrdinalsQuery for instance)

I think this is incorrect - queries will only build term statistics if they actually need them, whereas extractTerms doesn't always track that. TermWeight for example always adds its term, even if it has been created with ScoreMode.COMPLETE_NO_SCORES, while GlobalOrdinalsQuery creates its child weight with COMPLETE_NO_SCORES so no term stats will be pulled here.

I agree about the API change on rescorer though, let me think about a better way to do that.

@romseygeek
Copy link
Contributor Author

I've pushed a change removing extractTerms() entirely, and instead adding List<Query> getQueries() to RescoreContext, returning an empty list by default, and a singleton in QueryRescorer.

extractTerms() was partially broken anyway, as it was only collecting term stats, not field stats.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new approach , thanks @romseygeek

@romseygeek romseygeek merged commit 38d2935 into elastic:master Feb 15, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 15, 2019
* elastic/master:
  Avoid double term construction in DfsPhase (elastic#38716)
  Fix typo in DateRange docs (yyy → yyyy) (elastic#38883)
  Introduced class reuses follow parameter code between ShardFollowTasks (elastic#38910)
  Ensure random timestamps are within search boundary (elastic#38753)
  [CI] Muting  method testFollowIndex in IndexFollowingIT
  Update Lucene snapshot repo for 7.0.0-beta1 (elastic#38946)
  SQL: Doc on syntax (identifiers in particular) (elastic#38662)
  Upgrade to Gradle 5.2.1 (elastic#38880)
  Tie break search shard iterator comparisons on cluster alias (elastic#38853)
  Also mmap cfs files for hybridfs (elastic#38940)
  Build: Fix issue with test status logging (elastic#38799)
  Adapt FullClusterRestartIT on master (elastic#38856)
  Fix testAutoFollowing test to use createLeaderIndex() helper method.
  Migrate muted auto follow rolling upgrade test and unmute this test (elastic#38900)
  ShardBulkAction ignore primary response on primary (elastic#38901)
  Recover peers from translog, ignoring soft deletes (elastic#38904)
  Fix NPE on Stale Index in IndicesService (elastic#38891)
  Smarter CCR concurrent file chunk fetching (elastic#38841)
  Fix intermittent failure in ApiKeyIntegTests (elastic#38627)
  re-enable SmokeTestWatcherWithSecurityIT (elastic#38814)
romseygeek added a commit that referenced this pull request Feb 15, 2019
DfsPhase captures terms used for scoring a query in order to build global term statistics across
multiple shards for more accurate scoring. It currently does this by building the query's `Weight`
and calling `extractTerms` on it to collect terms, and then calling `IndexSearcher.termStatistics()`
for each collected term. This duplicates work, however, as the various `Weight` implementations 
will already have collected these statistics at construction time.

This commit replaces this round-about way of collecting stats, instead using a delegating
IndexSearcher that collects the term contexts and statistics when `IndexSearcher.termStatistics()`
is called from the Weight.

It also fixes a bug when using rescorers, where a `QueryRescorer` would calculate distributed term
statistics, but ignore field statistics.  `Rescorer.extractTerms` has been removed, and replaced with
a new method on `RescoreContext` that returns any queries used by the rescore implementation.
The delegating IndexSearcher then collects term contexts and statistics in the same way described
above for each Query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>refactoring :Search Relevance/Ranking Scoring, rescoring, rank evaluation. v7.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants