-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frame of reference encoding #2140
Conversation
7df9ce5
to
9868961
Compare
I did some quick benchmarks and they actually showed a small improvement in performance in general, probably because it was a large dataset with increasing consecutive integer IDs, and compression is done for each NodeGroup separately, so each ID NodeGroup can have an offset equal to the first ID in the group, letting the IDs be stored in 17 bits (log2 of the node group size) instead of 27 bits (for the max ID of 100,000,000). Copies similarly showed a small performance improvement, though that was only done once and the results are sufficiently variable that I don't really trust them:
(I'm also using btrfs with transparent zstd compression, but it's only compressing the hash index and metadata files, not the data file. The hash index files won't be affected by this change, and the metadata file is too small to have a noticeable impact on performance, so while it might affect the total copy time, I doubt it affects the differences) |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #2140 +/- ##
==========================================
- Coverage 89.63% 89.62% -0.01%
==========================================
Files 988 989 +1
Lines 35729 35740 +11
==========================================
+ Hits 32024 32031 +7
- Misses 3705 3709 +4
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
9868961
to
717b2d9
Compare
I'm informed that this is Frame of Reference encoding (without delta coding): https://lemire.me/blog/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/ |
This modifies the integer bitpacking compression to make use of a fixed offset when values are all large, and if it will reduce the number of bits required to store each value.
The offset is stored in the ColumnChunk metadata (increasing the size of that by 8 bytes), and removed from values when compressing and added back during decompression.
Decompression is done in-place, but does require an extra pass over the decompressed data when the offset is non-zero. Compression when the offset is non-zero is done using a small temporary buffer.
This should be effective at reducing the size needed for data such as timestamps, which usually fall into a relatively small range of values, but with a relatively high minimum value, and will still have very good random read/write performance. Large numeric IDs with a fixed number of digits should also benefit from this.
If also functionally implements constant compression for integers, as if all values are the same the data can be stored in 0 bits per value with the actual value stored as the offset.
I'll try and put together a quick benchmark with a generated dataset tailored to show the performance difference when this is used.