Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract Compression interface and initial integer packing #2004

Merged
merged 1 commit into from
Sep 25, 2023

Conversation

benjaminwinger
Copy link
Collaborator

@benjaminwinger benjaminwinger commented Sep 7, 2023

So far this just sets up an interface for compressing data on disk. Edit: Final version adds integer packing for int32, int64, uint32 and uint64

Compressed data must have a fixed number of values per page, but the number of values can vary per ColumnChunk (now stored in the ColumnChunkMetadata, where before it was a fixed value for a given type).
Additional compression-specific metadata is stored in a small fixed-sized buffer (currently just one byte) inside the ColumnChunkMetadata's new CompressionMetadata substructure.
Compressed data needs to be aligned to the start of each page, since data is read relative to the beginning of a page, so the functions called to compress data work one page at a time.

I've implemented an IntegerBitpacking class for the interface with an implementation using FastPFOR's fastpack/fastunpack functions, combined with DuckDB's SignExtend function used for negative values. Currently this only works on 32 and 64-bit integers as the interface for 16-bit fastpack/fastunpack is separate, but that should still be relatively simple to set up.

The BooleanCompression class is used just for decompression via the same interface, since boolean values are stored packed in memory in BoolColumnChunk, unlike CompressedColumnChunk which stores values uncompressed in memory and compresses them when writing the data (to avoid having to re-compress when collecting data).

The Integer Bitpacking isn't currently fully working, so I have integers just using the "CopyCompression" instead. Partially this is because re-compression and relocation isn't implemented when we update with data that can't fit inside the current bit width, but there are also some tests which aren't triggering the exceptions checking that case.

TODO (remaining to be moved to issue when this is merged):

  • INT16 and INT8 bitpacking
  • Re-compress and relocate values which don't fit in the existing chunk during modifications (either due to difference in bit width, or because the chunk was packed without support for signed values).
  • InternalID could be bitpacked, since it's stored as offset_t (uint32_t), but would need an extra layer to handle widening the results into the internal_id_t struct.
  • Extra values could be packed into the end of each page when the page does not divide evenly into 32-value chunks.
  • Small reads/writes to bitpacked data could be optimized to work on individual data instead of relying on packing/unpacking a 32-value chunk to a temporary array.
  • BooleanBitpacking could make use of integer bitpacking functions with a bitwidth of 1 which might be faster.
  • Set up some small benchmarks (maybe using google benchmark, essentially what I did when comparing implementations at the start) so that we can do some fast comparisons of the compression/decompression speed without having to deal with the overhead of the rest of the system (not that such integrated benchmarks aren't also useful).
  • Remaining bitpacking bugs? There are some tests still failing for unexplained reasons when bitpacking is enabled for INT64/INT32.

@benjaminwinger benjaminwinger changed the title implemented Abstract Compression interface and initial integer packing Abstract Compression interface and initial integer packing Sep 7, 2023
@benjaminwinger
Copy link
Collaborator Author

I just noticed that fastpfor's bitpackingaligned is 17000 lines of code. We're not even using it right now, so I've removed the 24 and 32-bit versions. Alternatively, duckdb implemented a 16-bit version based on the fastunpack/fastpack in bitpacking.hpp. It might be worth benchmarking and comparing them, but I think duckdb's version would at least be a lot smaller since bitpackingaligned.cpp with just 8 and 16-bit versions is still 5000 lines of code.

@benjaminwinger
Copy link
Collaborator Author

I think the remaining issue is with newly created pages, as they don't have the header with the type of compression.
Admittedly that's not really necessary at this point, and could also be moved into the ColumnChunkMetadata, but the same issue will occur with the integer bitpacking, where the bit width and whether or not there are negatives need to be stored somewhere accessible when writing to new pages (but that doesn't really extend to other ColumnChunks and compression types).

I think for now the most straightforward solution would be to have a small buffer (even just one byte) of reserved data in the ColumnChunkMetadata which can be used by the compression algorithm to store and access that sort of thing, and to drop the idea of storing extra metadata along with the data.

@codecov
Copy link

codecov bot commented Sep 8, 2023

Codecov Report

Attention: 70 lines in your changes are missing coverage. Please review.

Comparison is base (c22396f) 90.25% compared to head (54d4b6e) 90.21%.
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2004      +/-   ##
==========================================
- Coverage   90.25%   90.21%   -0.05%     
==========================================
  Files         963      966       +3     
  Lines       34654    35019     +365     
==========================================
+ Hits        31278    31592     +314     
- Misses       3376     3427      +51     
Files Coverage Δ
src/common/file_utils.cpp 77.64% <ø> (ø)
src/include/common/file_utils.h 100.00% <ø> (ø)
src/include/storage/local_table.h 100.00% <ø> (ø)
src/include/storage/store/column_chunk.h 90.19% <100.00%> (-3.93%) ⬇️
src/include/storage/store/node_column.h 100.00% <100.00%> (ø)
src/include/storage/store/sign_extend.h 100.00% <100.00%> (ø)
src/include/storage/store/string_column_chunk.h 100.00% <ø> (ø)
src/include/storage/store/var_list_column_chunk.h 100.00% <ø> (ø)
src/include/storage/store/var_list_node_column.h 100.00% <ø> (ø)
src/storage/local_table.cpp 94.17% <100.00%> (+4.21%) ⬆️
... and 7 more

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the middle of review. As discussed, here are todos for this PR:

  1. Move from PhysicalMapping to functor-based ideas.
  2. Fix compilation on Mx chip mac.
  3. Fix remaining bugs.
  4. Get benchmark results on COPY performance and compressed file sizes.
  5. Increment the project version, so we can trigger a re-run of benchmark loading.
  6. Can you open an issue on that we will add a config option ENABLE_COMPRESSION in the database level?

Let me know when you finish those todos, and we can do another round. Thanks!

src/include/storage/copier/column_chunk.h Outdated Show resolved Hide resolved
third_party/duckdb/LICENSE Outdated Show resolved Hide resolved
src/storage/copier/compression.cpp Outdated Show resolved Hide resolved
src/storage/store/node_column.cpp Outdated Show resolved Hide resolved
src/storage/copier/physical_mapping.cpp Outdated Show resolved Hide resolved
src/include/storage/copier/column_chunk.h Outdated Show resolved Hide resolved
src/storage/copier/column_chunk.cpp Outdated Show resolved Hide resolved
src/storage/store/node_column.cpp Outdated Show resolved Hide resolved
src/include/storage/store/string_node_column.h Outdated Show resolved Hide resolved
src/include/storage/copier/compression.h Outdated Show resolved Hide resolved
@benjaminwinger benjaminwinger force-pushed the compression branch 2 times, most recently from 0d3327e to 95da515 Compare September 12, 2023 18:17
@benjaminwinger
Copy link
Collaborator Author

Copy benchmarks:

Unfortunately I think reading and parsing the CSV files is probably the bulk of the time spent copying at the moment, given the large difference in speed between this and the scan benchmarks and the difference in speed between 0 and 64-bit values, but there is enough of a difference between the results to draw some conclusions.

The dataset I used was generated with 1,000,000 nodes with an id and a random signed integer. The results are an average of 50 runs, benchmarked using hyperfine and kuzu_shell.

Master Branch

64-bit values: 16MB data, 588.2 ms ± 4.7 ms
0-bit values: 16MB data, 417.8 ms ± 2.6 ms

Uncompressed

64-bit values: 15MB data, 592.6 ms ± 6.0 ms
0-bit values: 15MB data, 422.9 ms ± 4.0 ms

Bitpacking

Note that the ID column always has a bit width of 20 for 1000000 nodes), so really we're dealing with 84/52/20 bits of data, compared to 128 bits in the above cases:

64-bit values: 10MB data, 596.1 ms ± 5.3 ms
32-bit values: 6.5MB data, 501.7 ms ± 3.7 ms
0-bit values: 2.7MB data, 434.1 ms ± 2.6 ms

Scan Benchmarks

It was harder to see the differences here on a small dataset, so I ended up benchmarking a dataset with 100,000,000 nodes.
bitpacking-benchmarks.pdf

There is an odd outlier in the scan benchmarks, which is that scanning positive 63-bit numbers is significantly slower than scanning a mix of positive and negative 63-bit numbers, where I would expect it to be the other way around, given how unpacking signed numbers is implemented. It's possible that is related to an outstanding bug in the implementation.

src/include/storage/copier/compression.h Outdated Show resolved Hide resolved
src/include/storage/copier/compression.h Outdated Show resolved Hide resolved
src/include/storage/copier/compression.h Outdated Show resolved Hide resolved
src/include/storage/copier/compression.h Outdated Show resolved Hide resolved
src/include/storage/copier/compression.h Outdated Show resolved Hide resolved
src/storage/copier/column_chunk.cpp Outdated Show resolved Hide resolved
src/storage/copier/column_chunk.cpp Outdated Show resolved Hide resolved
src/storage/store/node_column.cpp Show resolved Hide resolved
src/storage/copier/compression.cpp Outdated Show resolved Hide resolved
src/storage/copier/compression.cpp Outdated Show resolved Hide resolved
@ray6080
Copy link
Contributor

ray6080 commented Sep 19, 2023

Can you document TODOs following this PR somewhere? In my mind, I think we should work on int16 and perhaps int8. And also fix updating bit width to larger size and updating unsigned values to signed.

@benjaminwinger
Copy link
Collaborator Author

Can you document TODOs following this PR somewhere?

I've added them to the PR description.

@benjaminwinger benjaminwinger force-pushed the compression branch 7 times, most recently from d24555e to 84923f1 Compare September 25, 2023 15:07
src/include/common/constants.h Outdated Show resolved Hide resolved
src/include/storage/store/node_column.h Outdated Show resolved Hide resolved
@benjaminwinger benjaminwinger merged commit da22d9e into master Sep 25, 2023
11 checks passed
@benjaminwinger benjaminwinger deleted the compression branch September 25, 2023 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants