-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstract Compression interface and initial integer packing #2004
Conversation
c0713ec
to
6bd2536
Compare
I just noticed that fastpfor's bitpackingaligned is 17000 lines of code. We're not even using it right now, so I've removed the 24 and 32-bit versions. Alternatively, duckdb implemented a 16-bit version based on the fastunpack/fastpack in bitpacking.hpp. It might be worth benchmarking and comparing them, but I think duckdb's version would at least be a lot smaller since bitpackingaligned.cpp with just 8 and 16-bit versions is still 5000 lines of code. |
I think the remaining issue is with newly created pages, as they don't have the header with the type of compression. I think for now the most straightforward solution would be to have a small buffer (even just one byte) of reserved data in the ColumnChunkMetadata which can be used by the compression algorithm to store and access that sort of thing, and to drop the idea of storing extra metadata along with the data. |
45ee267
to
827183c
Compare
827183c
to
33fdfe1
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #2004 +/- ##
==========================================
- Coverage 90.25% 90.21% -0.05%
==========================================
Files 963 966 +3
Lines 34654 35019 +365
==========================================
+ Hits 31278 31592 +314
- Misses 3376 3427 +51
☔ View full report in Codecov by Sentry. |
33fdfe1
to
ac442d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the middle of review. As discussed, here are todos for this PR:
- Move from PhysicalMapping to functor-based ideas.
- Fix compilation on Mx chip mac.
- Fix remaining bugs.
- Get benchmark results on COPY performance and compressed file sizes.
- Increment the project version, so we can trigger a re-run of benchmark loading.
- Can you open an issue on that we will add a config option
ENABLE_COMPRESSION
in the database level?
Let me know when you finish those todos, and we can do another round. Thanks!
0d3327e
to
95da515
Compare
Copy benchmarks:Unfortunately I think reading and parsing the CSV files is probably the bulk of the time spent copying at the moment, given the large difference in speed between this and the scan benchmarks and the difference in speed between 0 and 64-bit values, but there is enough of a difference between the results to draw some conclusions. The dataset I used was generated with 1,000,000 nodes with an id and a random signed integer. The results are an average of 50 runs, benchmarked using hyperfine and kuzu_shell. Master Branch64-bit values: 16MB data, 588.2 ms ± 4.7 ms Uncompressed64-bit values: 15MB data, 592.6 ms ± 6.0 ms BitpackingNote that the ID column always has a bit width of 20 for 1000000 nodes), so really we're dealing with 84/52/20 bits of data, compared to 128 bits in the above cases: 64-bit values: 10MB data, 596.1 ms ± 5.3 ms Scan BenchmarksIt was harder to see the differences here on a small dataset, so I ended up benchmarking a dataset with 100,000,000 nodes. There is an odd outlier in the scan benchmarks, which is that scanning positive 63-bit numbers is significantly slower than scanning a mix of positive and negative 63-bit numbers, where I would expect it to be the other way around, given how unpacking signed numbers is implemented. It's possible that is related to an outstanding bug in the implementation. |
5f695d0
to
11bbeec
Compare
Can you document TODOs following this PR somewhere? In my mind, I think we should work on int16 and perhaps int8. And also fix updating bit width to larger size and updating unsigned values to signed. |
11bbeec
to
5146941
Compare
I've added them to the PR description. |
d24555e
to
84923f1
Compare
84923f1
to
4f36e53
Compare
4f36e53
to
54d4b6e
Compare
So far this just sets up an interface for compressing data on disk. Edit: Final version adds integer packing for int32, int64, uint32 and uint64
Compressed data must have a fixed number of values per page, but the number of values can vary per ColumnChunk (now stored in the ColumnChunkMetadata, where before it was a fixed value for a given type).
Additional compression-specific metadata is stored in a small fixed-sized buffer (currently just one byte) inside the ColumnChunkMetadata's new CompressionMetadata substructure.
Compressed data needs to be aligned to the start of each page, since data is read relative to the beginning of a page, so the functions called to compress data work one page at a time.
I've implemented an IntegerBitpacking class for the interface with an implementation using FastPFOR's fastpack/fastunpack functions, combined with DuckDB's SignExtend function used for negative values. Currently this only works on 32 and 64-bit integers as the interface for 16-bit fastpack/fastunpack is separate, but that should still be relatively simple to set up.
The BooleanCompression class is used just for decompression via the same interface, since boolean values are stored packed in memory in BoolColumnChunk, unlike CompressedColumnChunk which stores values uncompressed in memory and compresses them when writing the data (to avoid having to re-compress when collecting data).
The Integer Bitpacking isn't currently fully working, so I have integers just using the "CopyCompression" instead. Partially this is because re-compression and relocation isn't implemented when we update with data that can't fit inside the current bit width, but there are also some tests which aren't triggering the exceptions checking that case.TODO (remaining to be moved to issue when this is merged):