Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent hash index performance improvements #2908

Merged
merged 3 commits into from
Feb 20, 2024
Merged

Conversation

benjaminwinger
Copy link
Collaborator

This adds the fingerprinting optimization mentioned in #2287. In some isolated and not very general tests, fingerprinting improved lookup times by about 10% for INT64, 20% for short strings (roughly the same length and with a common prefix) and 40% for long strings (about 24 characters, all roughly the same length and with a common prefix, so close to worst case for string comparison short of making the strings longer).

I also made a couple of other small optimizations:

  1. Disk array bounds checking has been changed to only occur with runtime checks enabled (disk array performance is a major bottleneck for the on-disk hash index).
  2. The hash index header is cached in memory for write transactions like it is for read transactions, and only written to the disk array/buffer manager at the end of prepareCommit (to reduce unnecessary load on the buffer manager for a small piece of fixed-sized data).

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I wonder how the lookup performance will be like if we get rid of ku_string_t in storage, which should save a bit on file space usage. Also, are there other known options for calculating the fingerprint? Should we play a bit with that, too?

@benjaminwinger
Copy link
Collaborator Author

Another option for the fingerprint would be to use a different hash function, which I think would only really have an advantage (albeit a significant one) when the hash index is large enough that the bits used for the fingerprint overlap with the bits used for the slots, but that will only start to occur when the capacity goes beyond 2^56 strings, so I don't think we should worry about it.

I suspect that removing ku_string_t won't have much of a cost at the moment, but it will have an effect on implementing multiple copy, since the persistent hash index implementation doesn't support multi-threaded appending to the overflow file, so requiring short strings to be written to the file as well as long strings will reduce performance until we can optimize concurrently appends.

Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (7fc4519) 93.40% compared to head (eb3e2ad) 93.47%.
Report is 30 commits behind head on master.

Files Patch % Lines
src/include/storage/index/hash_index.h 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2908      +/-   ##
==========================================
+ Coverage   93.40%   93.47%   +0.07%     
==========================================
  Files        1089     1117      +28     
  Lines       42121    42728     +607     
==========================================
+ Hits        39344    39942     +598     
- Misses       2777     2786       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benjaminwinger benjaminwinger merged commit f9a7e38 into master Feb 20, 2024
15 checks passed
@benjaminwinger benjaminwinger deleted the fingerprinting branch February 20, 2024 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants