Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Heap Usage of OnHeapStringDictionary #12223

Merged
merged 4 commits into from
Jan 25, 2024

Conversation

vvivekiyer
Copy link
Contributor

@vvivekiyer vvivekiyer commented Jan 5, 2024

This PR corresponds to issue 12078

  • This is disabled by default. It can be enabled by adding the following config

  "fieldConfigList": [
    {
      "name": "dimInt",
      "encodingType": "DICTIONARY",
      "indexTypes": [],
       "indexes": { 
          "dictionary": { 
             "onHeap": true, 
             "useVarLengthDictionary": true,
             "intern": {
                "capacity":32000000 
              } 
           } 
      },
      "tierOverwrites": null
    }

Heap Usage Before Interning
image

Heap Usage After Interning
image

@codecov-commenter
Copy link

codecov-commenter commented Jan 5, 2024

Codecov Report

Attention: 32 lines in your changes are missing coverage. Please review.

Comparison is base (8f5fa80) 61.50% compared to head (d8af5e2) 61.57%.
Report is 15 commits behind head on master.

Files Patch % Lines
...java/org/apache/pinot/spi/config/table/Intern.java 31.57% 11 Missing and 2 partials ⚠️
.../segment/index/dictionary/DictionaryIndexType.java 52.17% 8 Missing and 3 partials ⚠️
...pinot/segment/spi/index/DictionaryIndexConfig.java 66.66% 3 Missing and 1 partial ⚠️
...ent/index/dictionary/DictionaryInternerHolder.java 77.77% 2 Missing ⚠️
.../segment/index/readers/OnHeapStringDictionary.java 83.33% 0 Missing and 1 partial ⚠️
.../java/org/apache/pinot/spi/utils/FALFInterner.java 96.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12223      +/-   ##
============================================
+ Coverage     61.50%   61.57%   +0.06%     
- Complexity      207     1151     +944     
============================================
  Files          2416     2419       +3     
  Lines        131179   131295     +116     
  Branches      20246    20266      +20     
============================================
+ Hits          80686    80840     +154     
+ Misses        44595    44548      -47     
- Partials       5898     5907       +9     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.51% <65.95%> (+0.04%) ⬆️
java-21 61.44% <65.95%> (+0.06%) ⬆️
skip-bytebuffers-false 61.56% <65.95%> (+0.06%) ⬆️
skip-bytebuffers-true 61.40% <65.95%> (+0.04%) ⬆️
temurin 61.57% <65.95%> (+0.06%) ⬆️
unittests 61.56% <65.95%> (+0.06%) ⬆️
unittests1 46.61% <40.42%> (+0.01%) ⬆️
unittests2 27.74% <25.53%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vvivekiyer vvivekiyer marked this pull request as ready for review January 9, 2024 07:31
@gortiz
Copy link
Contributor

gortiz commented Jan 9, 2024

This is a very nice feature! Nice contribution @vvivekiyer!

I have some comments to add:

The concept itself

I understand that the issue was detected in on heap dictionaries, but I think we can use this approach also in offheap dictionaries as well. Specifically, interning could be applied when the value is read from the dictionary.

The config

I find this config too repetitive.

Instead of:

{
   "indexes": { 
      "dictionary": { 
         "onHeap": true, 
         "useVarLengthDictionary": true,
         "onHeapConfig": {
            "enableInterning":true, 
            "internerCapacity":32000000 
          }
       } 
   }
}

I would suggest something like:

{
   "indexes": { 
      "dictionary": { 
         "onHeap": true, 
         "useVarLengthDictionary": true,
         "intern": {
            "capacity":32000000 
          } 
       } 
   }
}

With an optional implicit field intern.disabled.

I would use this approach even if we do not support interning in offheap dictionaries (which is something we may decide to change in future). This approach is simpler and easier to read.

Support in older syntax

As said in the comments, I recommend against adding new features in the old syntax (in this case, in indexingConfig).

For compatibility reasons index-spi had to support all features that were supported in the old syntax, but we don't have to add support for new features in that syntax.

Users can migrate to the new syntax in case they want to use new features. The translation can even be done automatically. Is not like we want to force people to use the new syntax, but the index config logic is already too complex and 2 ways to configure each new feature will make it even more complex.

@gortiz
Copy link
Contributor

gortiz commented Jan 12, 2024

Approved, but it would be great to add a JMH benchmark. I would expect that in case we have two segments with dictionary enabled and large strings, a query that groups by them should be quite faster.

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Jan 23, 2024

Haven't gone through all the tests yet so asking if we have covered the following ?

  • Correctness test for writing and reading from a Dictionary with this feature enabled
  • Enabling this on an existing table via reloading the dictionary
  • Disabling this on an existing table via reloading the dictionary

@siddharthteotia
Copy link
Contributor

(nit) Please pretty print the sample config used in PR description for readability

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Jan 23, 2024

How does the concurrency aspect come into picture during query processing and segment reloads ?

@siddharthteotia
Copy link
Contributor

@somandal - can you also help take a look if you get a chance ?

@siddharthteotia siddharthteotia merged commit 76d0eb2 into apache:master Jan 25, 2024
19 checks passed
@Jackie-Jiang Jackie-Jiang added documentation release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) labels Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) documentation enhancement performance release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants