Abstract Compression interface and initial integer packing #2004

benjaminwinger · 2023-09-07T18:24:46Z

So far this just sets up an interface for compressing data on disk. Edit: Final version adds integer packing for int32, int64, uint32 and uint64

Compressed data must have a fixed number of values per page, but the number of values can vary per ColumnChunk (now stored in the ColumnChunkMetadata, where before it was a fixed value for a given type).
Additional compression-specific metadata is stored in a small fixed-sized buffer (currently just one byte) inside the ColumnChunkMetadata's new CompressionMetadata substructure.
Compressed data needs to be aligned to the start of each page, since data is read relative to the beginning of a page, so the functions called to compress data work one page at a time.

I've implemented an IntegerBitpacking class for the interface with an implementation using FastPFOR's fastpack/fastunpack functions, combined with DuckDB's SignExtend function used for negative values. Currently this only works on 32 and 64-bit integers as the interface for 16-bit fastpack/fastunpack is separate, but that should still be relatively simple to set up.

The BooleanCompression class is used just for decompression via the same interface, since boolean values are stored packed in memory in BoolColumnChunk, unlike CompressedColumnChunk which stores values uncompressed in memory and compresses them when writing the data (to avoid having to re-compress when collecting data).

The Integer Bitpacking isn't currently fully working, so I have integers just using the "CopyCompression" instead. Partially this is because re-compression and relocation isn't implemented when we update with data that can't fit inside the current bit width, but there are also some tests which aren't triggering the exceptions checking that case.

TODO (remaining to be moved to issue when this is merged):

INT16 and INT8 bitpacking
Re-compress and relocate values which don't fit in the existing chunk during modifications (either due to difference in bit width, or because the chunk was packed without support for signed values).
InternalID could be bitpacked, since it's stored as offset_t (uint32_t), but would need an extra layer to handle widening the results into the internal_id_t struct.
Extra values could be packed into the end of each page when the page does not divide evenly into 32-value chunks.
Small reads/writes to bitpacked data could be optimized to work on individual data instead of relying on packing/unpacking a 32-value chunk to a temporary array.
BooleanBitpacking could make use of integer bitpacking functions with a bitwidth of 1 which might be faster.
Set up some small benchmarks (maybe using google benchmark, essentially what I did when comparing implementations at the start) so that we can do some fast comparisons of the compression/decompression speed without having to deal with the overhead of the rest of the system (not that such integrated benchmarks aren't also useful).
Remaining bitpacking bugs? There are some tests still failing for unexplained reasons when bitpacking is enabled for INT64/INT32.

benjaminwinger · 2023-09-07T18:34:07Z

I just noticed that fastpfor's bitpackingaligned is 17000 lines of code. We're not even using it right now, so I've removed the 24 and 32-bit versions. Alternatively, duckdb implemented a 16-bit version based on the fastunpack/fastpack in bitpacking.hpp. It might be worth benchmarking and comparing them, but I think duckdb's version would at least be a lot smaller since bitpackingaligned.cpp with just 8 and 16-bit versions is still 5000 lines of code.

benjaminwinger · 2023-09-07T21:01:54Z

I think the remaining issue is with newly created pages, as they don't have the header with the type of compression.
Admittedly that's not really necessary at this point, and could also be moved into the ColumnChunkMetadata, but the same issue will occur with the integer bitpacking, where the bit width and whether or not there are negatives need to be stored somewhere accessible when writing to new pages (but that doesn't really extend to other ColumnChunks and compression types).

I think for now the most straightforward solution would be to have a small buffer (even just one byte) of reserved data in the ColumnChunkMetadata which can be used by the compression algorithm to store and access that sort of thing, and to drop the idea of storing extra metadata along with the data.

codecov · 2023-09-08T16:01:55Z

Codecov Report

Attention: 70 lines in your changes are missing coverage. Please review.

Comparison is base (c22396f) 90.25% compared to head (54d4b6e) 90.21%.
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2004      +/-   ##
==========================================
- Coverage   90.25%   90.21%   -0.05%     
==========================================
  Files         963      966       +3     
  Lines       34654    35019     +365     
==========================================
+ Hits        31278    31592     +314     
- Misses       3376     3427      +51

Files	Coverage Δ
src/common/file_utils.cpp	`77.64% <ø> (ø)`
src/include/common/file_utils.h	`100.00% <ø> (ø)`
src/include/storage/local_table.h	`100.00% <ø> (ø)`
src/include/storage/store/column_chunk.h	`90.19% <100.00%> (-3.93%)`	⬇️
src/include/storage/store/node_column.h	`100.00% <100.00%> (ø)`
src/include/storage/store/sign_extend.h	`100.00% <100.00%> (ø)`
src/include/storage/store/string_column_chunk.h	`100.00% <ø> (ø)`
src/include/storage/store/var_list_column_chunk.h	`100.00% <ø> (ø)`
src/include/storage/store/var_list_node_column.h	`100.00% <ø> (ø)`
src/storage/local_table.cpp	`94.17% <100.00%> (+4.21%)`	⬆️
... and 7 more

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ray6080

In the middle of review. As discussed, here are todos for this PR:

Move from PhysicalMapping to functor-based ideas.
Fix compilation on Mx chip mac.
Fix remaining bugs.
Get benchmark results on COPY performance and compressed file sizes.
Increment the project version, so we can trigger a re-run of benchmark loading.
Can you open an issue on that we will add a config option ENABLE_COMPRESSION in the database level?

Let me know when you finish those todos, and we can do another round. Thanks!

src/include/storage/copier/column_chunk.h

third_party/duckdb/LICENSE

src/storage/copier/compression.cpp

src/storage/store/node_column.cpp

src/storage/copier/physical_mapping.cpp

src/include/storage/copier/column_chunk.h

src/storage/copier/column_chunk.cpp

src/storage/store/node_column.cpp

src/include/storage/store/string_node_column.h

src/include/storage/copier/compression.h

benjaminwinger · 2023-09-12T21:55:05Z

Copy benchmarks:

Unfortunately I think reading and parsing the CSV files is probably the bulk of the time spent copying at the moment, given the large difference in speed between this and the scan benchmarks and the difference in speed between 0 and 64-bit values, but there is enough of a difference between the results to draw some conclusions.

The dataset I used was generated with 1,000,000 nodes with an id and a random signed integer. The results are an average of 50 runs, benchmarked using hyperfine and kuzu_shell.

Master Branch

64-bit values: 16MB data, 588.2 ms ± 4.7 ms
0-bit values: 16MB data, 417.8 ms ± 2.6 ms

Uncompressed

64-bit values: 15MB data, 592.6 ms ± 6.0 ms
0-bit values: 15MB data, 422.9 ms ± 4.0 ms

Bitpacking

Note that the ID column always has a bit width of 20 for 1000000 nodes), so really we're dealing with 84/52/20 bits of data, compared to 128 bits in the above cases:

64-bit values: 10MB data, 596.1 ms ± 5.3 ms
32-bit values: 6.5MB data, 501.7 ms ± 3.7 ms
0-bit values: 2.7MB data, 434.1 ms ± 2.6 ms

Scan Benchmarks

It was harder to see the differences here on a small dataset, so I ended up benchmarking a dataset with 100,000,000 nodes.
bitpacking-benchmarks.pdf

There is an odd outlier in the scan benchmarks, which is that scanning positive 63-bit numbers is significantly slower than scanning a mix of positive and negative 63-bit numbers, where I would expect it to be the other way around, given how unpacking signed numbers is implemented. It's possible that is related to an outstanding bug in the implementation.

src/include/storage/copier/compression.h

src/storage/copier/column_chunk.cpp

src/storage/store/node_column.cpp

src/storage/copier/compression.cpp

ray6080 · 2023-09-19T12:58:12Z

Can you document TODOs following this PR somewhere? In my mind, I think we should work on int16 and perhaps int8. And also fix updating bit width to larger size and updating unsigned values to signed.

benjaminwinger · 2023-09-19T17:52:00Z

Can you document TODOs following this PR somewhere?

I've added them to the PR description.

src/include/common/constants.h

src/include/storage/store/node_column.h

benjaminwinger changed the title ~~implemented Abstract Compression interface and initial integer packing~~ Abstract Compression interface and initial integer packing Sep 7, 2023

benjaminwinger force-pushed the compression branch from c0713ec to 6bd2536 Compare September 7, 2023 18:33

benjaminwinger force-pushed the compression branch 2 times, most recently from 45ee267 to 827183c Compare September 8, 2023 15:37

benjaminwinger marked this pull request as ready for review September 8, 2023 15:37

benjaminwinger force-pushed the compression branch from 827183c to 33fdfe1 Compare September 8, 2023 15:39

benjaminwinger force-pushed the compression branch from 33fdfe1 to ac442d8 Compare September 11, 2023 14:27

ray6080 reviewed Sep 12, 2023

View reviewed changes

benjaminwinger force-pushed the compression branch 2 times, most recently from 0d3327e to 95da515 Compare September 12, 2023 18:17

benjaminwinger force-pushed the compression branch 4 times, most recently from 5f695d0 to 11bbeec Compare September 18, 2023 18:09

benjaminwinger mentioned this pull request Sep 18, 2023

Compression Setting #2048

Closed

ray6080 approved these changes Sep 19, 2023

View reviewed changes

benjaminwinger force-pushed the compression branch from 11bbeec to 5146941 Compare September 19, 2023 13:45

benjaminwinger force-pushed the compression branch 7 times, most recently from d24555e to 84923f1 Compare September 25, 2023 15:07

benjaminwinger force-pushed the compression branch from 84923f1 to 4f36e53 Compare September 25, 2023 16:32

ray6080 reviewed Sep 25, 2023

View reviewed changes

src/include/common/constants.h Outdated Show resolved Hide resolved

src/include/storage/store/node_column.h Outdated Show resolved Hide resolved

implemented Abstract Compression interface and initial integer packing

54d4b6e

benjaminwinger force-pushed the compression branch from 4f36e53 to 54d4b6e Compare September 25, 2023 18:22

benjaminwinger merged commit da22d9e into master Sep 25, 2023
11 checks passed

benjaminwinger deleted the compression branch September 25, 2023 19:51

This was referenced Sep 25, 2023

Transaction API issues #2082

Closed

Integer Bitpacking extensions #2083

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract Compression interface and initial integer packing #2004

Abstract Compression interface and initial integer packing #2004

benjaminwinger commented Sep 7, 2023 •

edited

Loading

benjaminwinger commented Sep 7, 2023

benjaminwinger commented Sep 7, 2023

codecov bot commented Sep 8, 2023 •

edited

Loading

ray6080 left a comment

benjaminwinger commented Sep 12, 2023

ray6080 commented Sep 19, 2023

benjaminwinger commented Sep 19, 2023

Abstract Compression interface and initial integer packing #2004

Abstract Compression interface and initial integer packing #2004

Conversation

benjaminwinger commented Sep 7, 2023 • edited Loading

benjaminwinger commented Sep 7, 2023

benjaminwinger commented Sep 7, 2023

codecov bot commented Sep 8, 2023 • edited Loading

Codecov Report

ray6080 left a comment

Choose a reason for hiding this comment

benjaminwinger commented Sep 12, 2023

Copy benchmarks:

Master Branch

Uncompressed

Bitpacking

Scan Benchmarks

ray6080 commented Sep 19, 2023

benjaminwinger commented Sep 19, 2023

benjaminwinger commented Sep 7, 2023 •

edited

Loading

codecov bot commented Sep 8, 2023 •

edited

Loading