Frame of reference encoding #2140

benjaminwinger · 2023-10-04T14:41:36Z

This modifies the integer bitpacking compression to make use of a fixed offset when values are all large, and if it will reduce the number of bits required to store each value.
The offset is stored in the ColumnChunk metadata (increasing the size of that by 8 bytes), and removed from values when compressing and added back during decompression.

Decompression is done in-place, but does require an extra pass over the decompressed data when the offset is non-zero. Compression when the offset is non-zero is done using a small temporary buffer.

This should be effective at reducing the size needed for data such as timestamps, which usually fall into a relatively small range of values, but with a relatively high minimum value, and will still have very good random read/write performance. Large numeric IDs with a fixed number of digits should also benefit from this.
If also functionally implements constant compression for integers, as if all values are the same the data can be stored in 0 bits per value with the actual value stored as the offset.

I'll try and put together a quick benchmark with a generated dataset tailored to show the performance difference when this is used.

benjaminwinger · 2023-10-04T17:11:18Z

I did some quick benchmarks and they actually showed a small improvement in performance in general, probably because it was a large dataset with increasing consecutive integer IDs, and compression is done for each NodeGroup separately, so each ID NodeGroup can have an offset equal to the first ID in the group, letting the IDs be stored in 17 bits (log2 of the node group size) instead of 27 bits (for the max ID of 100,000,000).

offset-compression.pdf

Copies similarly showed a small performance improvement, though that was only done once and the results are sufficiently variable that I don't really trust them:

Empty data: 35,098ms -> 28,307ms
Large Offset + 0-100: 30074ms -> 29108ms
Small values from 0-100: 35584ms -> 31514ms
Random 64-bit: 36130ms -> 28935ms

(I'm also using btrfs with transparent zstd compression, but it's only compressing the hash index and metadata files, not the data file. The hash index files won't be affected by this change, and the metadata file is too small to have a noticeable impact on performance, so while it might affect the total copy time, I doubt it affects the differences)

codecov · 2023-10-04T17:29:13Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (ede12a7) 89.63% compared to head (717b2d9) 89.62%.
Report is 15 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2140      +/-   ##
==========================================
- Coverage   89.63%   89.62%   -0.01%     
==========================================
  Files         988      989       +1     
  Lines       35729    35740      +11     
==========================================
+ Hits        32024    32031       +7     
- Misses       3705     3709       +4

Files	Coverage Δ
src/include/storage/store/compression.h	`96.22% <100.00%> (ø)`
src/storage/store/compression.cpp	`75.00% <95.83%> (+1.73%)`	⬆️

... and 23 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ray6080

LGTM!

benjaminwinger · 2023-10-05T22:05:51Z

I'm informed that this is Frame of Reference encoding (without delta coding): https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/
I've updated some of the terminology in this PR to reflect that.

benjaminwinger force-pushed the fixed-delta branch from 7df9ce5 to 9868961 Compare October 4, 2023 17:10

ray6080 self-requested a review October 5, 2023 02:34

ray6080 approved these changes Oct 5, 2023

View reviewed changes

Augment integer bitpacking compression with frame of reference encoding

717b2d9

benjaminwinger force-pushed the fixed-delta branch from 9868961 to 717b2d9 Compare October 5, 2023 22:05

benjaminwinger changed the title ~~Fixed delta compression~~ Frame of reference encoding Oct 5, 2023

benjaminwinger merged commit 2a19754 into master Oct 6, 2023
11 checks passed

benjaminwinger deleted the fixed-delta branch October 6, 2023 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frame of reference encoding #2140

Frame of reference encoding #2140

benjaminwinger commented Oct 4, 2023

benjaminwinger commented Oct 4, 2023

codecov bot commented Oct 4, 2023 •

edited

Loading

ray6080 left a comment

benjaminwinger commented Oct 5, 2023

Frame of reference encoding #2140

Frame of reference encoding #2140

Conversation

benjaminwinger commented Oct 4, 2023

benjaminwinger commented Oct 4, 2023

codecov bot commented Oct 4, 2023 • edited Loading

Codecov Report

ray6080 left a comment

Choose a reason for hiding this comment

benjaminwinger commented Oct 5, 2023

codecov bot commented Oct 4, 2023 •

edited

Loading