Persistent hash index performance improvements #2908

benjaminwinger · 2024-02-16T23:22:20Z

This adds the fingerprinting optimization mentioned in #2287. In some isolated and not very general tests, fingerprinting improved lookup times by about 10% for INT64, 20% for short strings (roughly the same length and with a common prefix) and 40% for long strings (about 24 characters, all roughly the same length and with a common prefix, so close to worst case for string comparison short of making the strings longer).

I also made a couple of other small optimizations:

Disk array bounds checking has been changed to only occur with runtime checks enabled (disk array performance is a major bottleneck for the on-disk hash index).
The hash index header is cached in memory for write transactions like it is for read transactions, and only written to the disk array/buffer manager at the end of prepareCommit (to reduce unnecessary load on the buffer manager for a small piece of fixed-sized data).

ray6080

LGTM. I wonder how the lookup performance will be like if we get rid of ku_string_t in storage, which should save a bit on file space usage. Also, are there other known options for calculating the fingerprint? Should we play a bit with that, too?

benjaminwinger · 2024-02-20T16:07:50Z

Another option for the fingerprint would be to use a different hash function, which I think would only really have an advantage (albeit a significant one) when the hash index is large enough that the bits used for the fingerprint overlap with the bits used for the slots, but that will only start to occur when the capacity goes beyond 2^56 strings, so I don't think we should worry about it.

I suspect that removing ku_string_t won't have much of a cost at the moment, but it will have an effect on implementing multiple copy, since the persistent hash index implementation doesn't support multi-threaded appending to the overflow file, so requiring short strings to be written to the file as well as long strings will reduce performance until we can optimize concurrently appends.

codecov · 2024-02-20T16:28:43Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (7fc4519) 93.40% compared to head (eb3e2ad) 93.47%.
Report is 30 commits behind head on master.

Files	Patch %	Lines
src/include/storage/index/hash_index.h	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2908      +/-   ##
==========================================
+ Coverage   93.40%   93.47%   +0.07%     
==========================================
  Files        1089     1117      +28     
  Lines       42121    42728     +607     
==========================================
+ Hits        39344    39942     +598     
- Misses       2777     2786       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benjaminwinger added 2 commits February 16, 2024 18:22

Check disk array out of bounds only with runtime checks enabled

59a352c

Cache the hash index header for write transactions

d7abc3f

benjaminwinger force-pushed the fingerprinting branch from b61773a to c7a04a4 Compare February 16, 2024 23:22

ray6080 approved these changes Feb 17, 2024

View reviewed changes

benjaminwinger force-pushed the fingerprinting branch from c7a04a4 to 8640064 Compare February 20, 2024 16:12

Hash index fingerprinting

eb3e2ad

benjaminwinger force-pushed the fingerprinting branch from 8640064 to eb3e2ad Compare February 20, 2024 16:45

benjaminwinger merged commit f9a7e38 into master Feb 20, 2024
15 checks passed

benjaminwinger deleted the fingerprinting branch February 20, 2024 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent hash index performance improvements #2908

Persistent hash index performance improvements #2908

benjaminwinger commented Feb 16, 2024

ray6080 left a comment •

edited

Loading

benjaminwinger commented Feb 20, 2024

codecov bot commented Feb 20, 2024 •

edited

Loading

Persistent hash index performance improvements #2908

Persistent hash index performance improvements #2908

Conversation

benjaminwinger commented Feb 16, 2024

ray6080 left a comment • edited Loading

Choose a reason for hiding this comment

benjaminwinger commented Feb 20, 2024

codecov bot commented Feb 20, 2024 • edited Loading

Codecov Report

ray6080 left a comment •

edited

Loading

codecov bot commented Feb 20, 2024 •

edited

Loading