Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frame of reference encoding #2140

Merged
merged 1 commit into from
Oct 6, 2023
Merged

Frame of reference encoding #2140

merged 1 commit into from
Oct 6, 2023

Conversation

benjaminwinger
Copy link
Collaborator

This modifies the integer bitpacking compression to make use of a fixed offset when values are all large, and if it will reduce the number of bits required to store each value.
The offset is stored in the ColumnChunk metadata (increasing the size of that by 8 bytes), and removed from values when compressing and added back during decompression.

Decompression is done in-place, but does require an extra pass over the decompressed data when the offset is non-zero. Compression when the offset is non-zero is done using a small temporary buffer.

This should be effective at reducing the size needed for data such as timestamps, which usually fall into a relatively small range of values, but with a relatively high minimum value, and will still have very good random read/write performance. Large numeric IDs with a fixed number of digits should also benefit from this.
If also functionally implements constant compression for integers, as if all values are the same the data can be stored in 0 bits per value with the actual value stored as the offset.

I'll try and put together a quick benchmark with a generated dataset tailored to show the performance difference when this is used.

@benjaminwinger
Copy link
Collaborator Author

I did some quick benchmarks and they actually showed a small improvement in performance in general, probably because it was a large dataset with increasing consecutive integer IDs, and compression is done for each NodeGroup separately, so each ID NodeGroup can have an offset equal to the first ID in the group, letting the IDs be stored in 17 bits (log2 of the node group size) instead of 27 bits (for the max ID of 100,000,000).

offset-compression.pdf

Copies similarly showed a small performance improvement, though that was only done once and the results are sufficiently variable that I don't really trust them:

  • Empty data: 35,098ms -> 28,307ms
  • Large Offset + 0-100: 30074ms -> 29108ms
  • Small values from 0-100: 35584ms -> 31514ms
  • Random 64-bit: 36130ms -> 28935ms

(I'm also using btrfs with transparent zstd compression, but it's only compressing the hash index and metadata files, not the data file. The hash index files won't be affected by this change, and the metadata file is too small to have a noticeable impact on performance, so while it might affect the total copy time, I doubt it affects the differences)

@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (ede12a7) 89.63% compared to head (717b2d9) 89.62%.
Report is 15 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2140      +/-   ##
==========================================
- Coverage   89.63%   89.62%   -0.01%     
==========================================
  Files         988      989       +1     
  Lines       35729    35740      +11     
==========================================
+ Hits        32024    32031       +7     
- Misses       3705     3709       +4     
Files Coverage Δ
src/include/storage/store/compression.h 96.22% <100.00%> (ø)
src/storage/store/compression.cpp 75.00% <95.83%> (+1.73%) ⬆️

... and 23 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ray6080 ray6080 self-requested a review October 5, 2023 02:34
Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@benjaminwinger
Copy link
Collaborator Author

I'm informed that this is Frame of Reference encoding (without delta coding): https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
I've updated some of the terminology in this PR to reflect that.

@benjaminwinger benjaminwinger changed the title Fixed delta compression Frame of reference encoding Oct 5, 2023
@benjaminwinger benjaminwinger merged commit 2a19754 into master Oct 6, 2023
11 checks passed
@benjaminwinger benjaminwinger deleted the fixed-delta branch October 6, 2023 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants