[Concurrent Segment Search] SignificantTerms agg should not gather bg_count for each slice #8703

jed326 · 2023-07-14T19:31:51Z

Subtask as a part of #8509

To Reproduce

./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.aggregations.bucket.SignificantTermsSignificanceScoreIT.testXContentResponse" -Dtests.seed=6D7B107FC7821E62 -Dtests.locale=mk -Dtests.timezone=America/Atka -Druntime.java=17

Details

This test is failing due to an unexpected result in the query response:
Expected:

{
    "class": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [{
            "key": "0",
            "doc_count": 4,
            "sig_terms": {
                "doc_count": 4,
                "bg_count": 7,
                "buckets": [{
                    "key": 0,
                    "doc_count": 4,
                    "score": 0.39999999999999997,
                    "bg_count": 5
                }]
            }
        }, {
            "key": "1",
            "doc_count": 3,
            "sig_terms": {
                "doc_count": 3,
                "bg_count": 7,
                "buckets": [{
                    "key": 1,
                    "doc_count": 3,
                    "score": 0.75,
                    "bg_count": 4
                }]
            }
        }]
    }
}

Actual:

{
    "class": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [{
            "key": "0",
            "doc_count": 4,
            "sig_terms": {
                "doc_count": 4,
                "bg_count": 28,
                "buckets": [{
                    "key": 0,
                    "doc_count": 4,
                    "score": 0.39999999999999997,
                    "bg_count": 20
                }]
            }
        }, {
            "key": "1",
            "doc_count": 3,
            "sig_terms": {
                "doc_count": 3,
                "bg_count": 21,
                "buckets": [{
                    "key": 1,
                    "doc_count": 3,
                    "score": 0.75,
                    "bg_count": 12
                }]
            }
        }]
    }
}

Specifically, the bg_count is wrong in the concurrent search case. In a Significant terms aggregation, bg_count is the number of docs with the term in the background set while doc_count is the number of docs with the term in the foreground set.

So at the top level, the background set is the number of docs in the index with the “class” field, so the bg_count should simply be the number of docs in the index. Code ref:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/SignificanceLookup.java

Line 100 in 8eea7b9

    
           supersetNumDocs = backgroundFilter == null ? searcher.getIndexReader().maxDoc() : searcher.count(this.backgroundFilter);

Background Set == All documents in the index with “class” field
At the inner level, the bg_count therefore refers to all of the documents in the background set with the given key in the “text” field. See how index is created here:

OpenSearch/server/src/test/java/org/opensearch/test/search/aggregations/bucket/SharedSignificantTermsTestMethods.java

Lines 94 to 115 in b9f0440

    
           public static void index01Docs(String type, String settings, OpenSearchIntegTestCase testCase) throws ExecutionException, 
        
               InterruptedException { 
        
               String textMappings = "type=" + type; 
        
               if (type.equals("text")) { 
        
                   textMappings += ",fielddata=true"; 
        
               } 
        
               assertAcked( 
        
                   testCase.prepareCreate(INDEX_NAME) 
        
                       .setSettings(settings, XContentType.JSON) 
        
                       .setMapping("text", textMappings, CLASS_FIELD, "type=keyword") 
        
               ); 
        
               String[] gb = { "0", "1" }; 
        
               List<IndexRequestBuilder> indexRequestBuilderList = new ArrayList<>(); 
        
               indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME).setId("1").setSource(TEXT_FIELD, "1", CLASS_FIELD, "1")); 
        
               indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME).setId("2").setSource(TEXT_FIELD, "1", CLASS_FIELD, "1")); 
        
               indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME).setId("3").setSource(TEXT_FIELD, "0", CLASS_FIELD, "0")); 
        
               indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME).setId("4").setSource(TEXT_FIELD, "0", CLASS_FIELD, "0")); 
        
               indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME).setId("5").setSource(TEXT_FIELD, gb, CLASS_FIELD, "1")); 
        
               indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME).setId("6").setSource(TEXT_FIELD, gb, CLASS_FIELD, "0")); 
        
               indexRequestBuilderList.add(client().prepareIndex(INDEX_NAME).setId("7").setSource(TEXT_FIELD, "0", CLASS_FIELD, "0")); 
        
               testCase.indexRandom(true, false, indexRequestBuilderList); 
        
           }

So for key “0” this is 5, and for key “1” this is 4. This value is retrieved through a querying the index from the search thread. Code ref:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/NumericTermsAggregator.java

Lines 647 to 656 in 8eea7b9

    
           @Override 
        
           void updateBucket(SignificantLongTerms.Bucket spare, BucketOrdsEnum ordsEnum, long docCount) throws IOException { 
        
               spare.term = ordsEnum.value(); 
        
               spare.subsetDf = docCount; 
        
               spare.supersetDf = backgroundFrequencies.freq(spare.term); 
        
               spare.bucketOrd = ordsEnum.ord(); 
        
               // During shard-local down-selection we use subset/superset stats that are for this shard only 
        
               // Back at the central reducer these properties will be updated with global stats 
        
               spare.updateScore(significanceHeuristic); 
        
           }

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java

Line 897 in e354201

spare.supersetDf = backgroundFrequencies.freq(spare.termBytes);

Just to crystalize the meanings of background and foreground sets, let’s break down the example in the ElasticSearch docs here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_single_set_analysis

Here the outer level bg_count refers to the number of documents with the field crime_type in the background set, which is the entire index. The query is for force == British Transport Police, so doc_count represents the number of documents where the force field matches British Transport Police, also known as our foreground set.
Background Set == Number of crimes in the index
Foreground Set == Number of crimes where force == British Transport Police
So now looking at the inner level, bg_count is going to be the number of documents where crime_type == Bicycle theft in the Background set, while doc_count is going to be the number of documents matching the same condition in the Foreground set.

References:

During the buildAggregations phase, the bg_count is gathered per collector. In the non-concurrent search case, that means it will only happen once per shard, and then at the coordinator level reduce can just sum up the bg_count across all of the shards to get the total. However, in the concurrent search use case we will have a collector per segment slice, and then at the shard-level reduce these will get summed up, so we will end up with SEGMENT_SLICE_COUNT * BG_COUNT documents.

Small caveat here is that this will only happen for slices that actually have docs, so in our test example, if a slice does not have any documents that have “text” field “0” then that collector will not contribute to the BG_COUNT miscount.
Here is an example to illustrate this problem:
In a test run, if the buckets look like this

0 = {StringTerms@13132} "{"class":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,"buckets":[{"key":"1","doc_count":2,"sig_terms":{"doc_count":2,"bg_count":7,"buckets":[]}}]}}"
1 = {StringTerms@13133} "{"class":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,"buckets":[{"key":"0","doc_count":1,"sig_terms":{"doc_count":1,"bg_count":7,"buckets":[]}},{"key":"1","doc_count":1,"sig_terms":{"doc_count":1,"bg_count":7,"buckets":[]}}]}}"
2 = {StringTerms@13134} "{"class":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,"buckets":[{"key":"0","doc_count":2,"sig_terms":{"doc_count":2,"bg_count":7,"buckets":[]}}]}}"
3 = {StringTerms@13135} "{"class":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,"buckets":[{"key":"0","doc_count":1,"sig_terms":{"doc_count":1,"bg_count":7,"buckets":[]}}]}}"

Then there are 5 buckets total, 2 for key “1” and 3 for key “0”, so if you look at the results below:

Expected:

{
    "class": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [{
            "key": "0",
            "doc_count": 4,
            "sig_terms": {
                "doc_count": 4,
                "bg_count": 7,
                "buckets": [{
                    "key": 0,
                    "doc_count": 4,
                    "score": 0.39999999999999997,
                    "bg_count": 5
                }]
            }
        }, {
            "key": "1",
            "doc_count": 3,
            "sig_terms": {
                "doc_count": 3,
                "bg_count": 7,
                "buckets": [{
                    "key": 1,
                    "doc_count": 3,
                    "score": 0.75,
                    "bg_count": 4
                }]
            }
        }]
    }
}

Actual:

{
    "class": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [{
            "key": "0",
            "doc_count": 4,
            "sig_terms": {
                "doc_count": 4,
                "bg_count": 21,
                "buckets": [{
                    "key": 0,
                    "doc_count": 4,
                    "score": 0.39999999999999997,
                    "bg_count": 15
                }]
            }
        }, {
            "key": "1",
            "doc_count": 3,
            "sig_terms": {
                "doc_count": 3,
                "bg_count": 14,
                "buckets": [{
                    "key": 1,
                    "doc_count": 3,
                    "score": 0.75,
                    "bg_count": 8
                }]
            }
        }]
    }
}

We see that key “1” has 2x the expected count while key “0” has 3x.

Single set case for reference:

    "aggregations": {
        "sig_terms": {
            "doc_count": 4,
            "bg_count": 21,
            "buckets": [{
                "key": 0,
                "doc_count": 4,
                "score": 0.39999999999999997,
                "bg_count": 15
            }]
        }
    }

Solutions

Without a deeper understanding of the scoring mechanism, it would be best for a solution to fix the scores in the slice level rather than the coordinator level.

I have added a few possible solutions below, with solution 3 looking the most promising to me and I will be working on a PR for that.

1. Divide by the bucket count

Due to the caveat above, in sparse data cases if a bucket is not collected in a segment slice then it won’t contribute to the bg_count. That means we can’t just blindly divide these numbers by the slice count. We would need to figure out some way to get the bucket count per collector to see how to divide the bg_count.

2. Separate out the shard-level vs slice-level reduce logic

Slice level reduce:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/AggregationCollectorManager.java

Lines 70 to 74 in 064f265

    
           // using reduce is fine here instead of topLevelReduce as pipeline aggregation is evaluated on the coordinator after all 
        
           // documents are collected across shards for an aggregation 
        
           return new AggregationReduceableSearchResult( 
        
               InternalAggregations.reduce(Collections.singletonList(internalAggregations), context.partial()) 
        
           );

Add a flag to reduce context and then don’t sum up supersetDf here:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalSignificantTerms.java

Lines 286 to 299 in e354201

    
           @Override 
        
           protected B reduceBucket(List<B> buckets, ReduceContext context) { 
        
               assert buckets.size() > 0; 
        
               long subsetDf = 0; 
        
               long supersetDf = 0; 
        
               List<InternalAggregations> aggregationsList = new ArrayList<>(buckets.size()); 
        
               for (B bucket : buckets) { 
        
                   subsetDf += bucket.subsetDf; 
        
                   supersetDf += bucket.supersetDf; 
        
                   aggregationsList.add(bucket.aggregations); 
        
               } 
        
               InternalAggregations aggs = InternalAggregations.reduce(aggregationsList, context); 
        
               return createBucket(subsetDf, buckets.get(0).subsetSize, supersetDf, buckets.get(0).supersetSize, aggs, buckets.get(0)); 
        
           }

Nor supersetSize here:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalSignificantTerms.java

Lines 231 to 236 in e354201

    
           for (InternalAggregation aggregation : aggregations) { 
        
               @SuppressWarnings("unchecked") 
        
               InternalSignificantTerms<A, B> terms = (InternalSignificantTerms<A, B>) aggregation; 
        
               globalSubsetSize += terms.getSubsetSize(); 
        
               globalSupersetSize += terms.getSupersetSize(); 
        
           }

3. Create a singleton `SignificanceLookup` that will only return backgroundFrequencies and supersetSize once per bucket

This seems to be the best solution to me because the sum bg_count will be correct going into the aggs level reduce so we won't need to modify that. Moreover, we're actually doing up to 2 * NUM_SLICES * NUM_BUCKETS extra queries to get the bg_count for every single bucket at the slice level, so this singleton solution will eliminate all but up to 2 of these extra queries.

Currently I am thinking we can use searchContext.isConcurrentSegmentSearchEnabled() here:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/SignificantTermsAggregatorFactory.java

Lines 290 to 295 in 9b9158e

    
           SignificanceLookup lookup = new SignificanceLookup( 
        
               queryShardContext, 
        
               config.fieldContext().fieldType(), 
        
               config.format(), 
        
               backgroundFilter 
        
           );

and then further use the queryShardContext to determine when to rest the Singleton.

Note that we will most likely also need to do this in SignificantTextAggregatorFactory here:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/SignificantTextAggregatorFactory.java

Line 157 in 8eea7b9

    
           SignificanceLookup lookup = new SignificanceLookup(queryShardContext, fieldType, DocValueFormat.RAW, backgroundFilter);

However I'm seeing assert codec failures in that execution path as mentioned in #8509 (comment). Nevertheless the same solution should be applicable in both cases here.

I still need to hash out some of the finer details here, such as how to reset the Singleton class per bucket, however conceptually this solution is both the most intuitive and least intrusive.

The text was updated successfully, but these errors were encountered:

jed326 · 2023-07-14T19:44:07Z

Tagging @reta @sohami to see if you have any additional thoughts while I work on a PR for solution 3

jed326 · 2023-07-17T22:07:52Z

After taking a look at this, I think solution 2 is actually much more straightforward. ReduceContext already has a isFinalReduce() method that is true for the coordinator level reduce. We can use that method in the places mentioned above and then we can leave the slice level query optimization for later. Will open a PR with this implementation shortly.

jed326 added bug Something isn't working untriaged distributed framework and removed untriaged labels Jul 14, 2023

jed326 self-assigned this Jul 14, 2023

jed326 mentioned this issue Jul 14, 2023

[Concurrent Segment Search] Fix SignificantTerms related ITs #8509

Closed

jed326 mentioned this issue Jul 17, 2023

Change InternalSignificantTerms to only sum shard level counts in final reduce #8735

Merged

6 tasks

andrross added the v2.10.0 label Jul 19, 2023

jed326 mentioned this issue Jul 20, 2023

[Concurrent Segment Search] Optimize Significant Terms Agg to not perform count query for each segment slice #8789

Open

andrross closed this as completed in #8735 Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Concurrent Segment Search] SignificantTerms agg should not gather bg_count for each slice #8703

[Concurrent Segment Search] SignificantTerms agg should not gather bg_count for each slice #8703

jed326 commented Jul 14, 2023 •

edited

Loading

jed326 commented Jul 14, 2023

jed326 commented Jul 17, 2023

[Concurrent Segment Search] SignificantTerms agg should not gather bg_count for each slice #8703

[Concurrent Segment Search] SignificantTerms agg should not gather bg_count for each slice #8703

Comments

jed326 commented Jul 14, 2023 • edited Loading

Details

Solutions

1. Divide by the bucket count

2. Separate out the shard-level vs slice-level reduce logic

3. Create a singleton SignificanceLookup that will only return backgroundFrequencies and supersetSize once per bucket

jed326 commented Jul 14, 2023

jed326 commented Jul 17, 2023

jed326 commented Jul 14, 2023 •

edited

Loading

3. Create a singleton `SignificanceLookup` that will only return backgroundFrequencies and supersetSize once per bucket