Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionary compression #2408

Merged
merged 3 commits into from
Nov 18, 2023
Merged

Dictionary compression #2408

merged 3 commits into from
Nov 18, 2023

Conversation

benjaminwinger
Copy link
Collaborator

Note that this depends on #2304

This is implemented inside of StringColumn and StringColumnChunk. With the string column layout being hard-coded into the class, it doesn't really seem to fit the abstractions used for the compression of single chunks.

Implemented in two parts:

  1. When copying to the StringColumnChunk, values are de-duplicated as they are stored. This significant memory overhead for data without much duplication, since every string inserted is stored both in the chunk and in an unordered_map that's used to look up the indices of elements already inserted (a copy must be made since references won't survive when the data chunk is resized).
  2. Before flushing to the disk, the prepareForFlush method goes over the data again and de-duplicates it. In this case references to the original data are stable and we can use string_view instead of string. This is necessary if we want to remove unused entries, e.g. when strings are read from the disk, existing values are modified, and the original values are no longer used.
  3. I've also optimized the scans to avoid both scanning strings multiple times, and copying strings multiple times into a single ValueVector. This is independent of the other parts but speeds up scans for duplicated data.

If the extra memory overhead when copying is a concern, we could remove (1), since (2) is sufficient for on-disk de-duplication, but (1) should speed up copying when there is a lot of duplication (I don't have any benchmarks for this though, and will do some benchmarking and profiling to compare the costs if I have time).

It would also be possible to reduce the memory overhead of (1) by using string_views and updating them to point to the new buffer after a resize, but that will also add some new costs.

src/storage/store/string_column_chunk.cpp Outdated Show resolved Hide resolved
src/include/storage/store/string_column_chunk.h Outdated Show resolved Hide resolved
src/storage/store/string_column.cpp Outdated Show resolved Hide resolved
src/storage/store/string_column_chunk.cpp Outdated Show resolved Hide resolved
src/storage/store/string_column_chunk.cpp Show resolved Hide resolved
src/storage/store/string_column_chunk.cpp Show resolved Hide resolved
@benjaminwinger benjaminwinger force-pushed the dictionary-compression branch 2 times, most recently from d13813d to ed46f11 Compare November 14, 2023 23:08
@benjaminwinger benjaminwinger force-pushed the string-serialization branch 2 times, most recently from 1518e90 to a92953d Compare November 15, 2023 17:26
@benjaminwinger benjaminwinger force-pushed the string-serialization branch 5 times, most recently from feb92bb to 54a2273 Compare November 16, 2023 19:59
Base automatically changed from string-serialization to master November 16, 2023 21:54
@benjaminwinger benjaminwinger force-pushed the dictionary-compression branch 2 times, most recently from 15c1ed8 to 64a5a36 Compare November 17, 2023 21:55
Copy link

codecov bot commented Nov 17, 2023

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (7ef3d5c) 90.86% compared to head (92bd3e4) 91.39%.
Report is 13 commits behind head on master.

Files Patch % Lines
src/storage/store/string_column_chunk.cpp 92.30% 3 Missing ⚠️
src/common/types/ku_string.cpp 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2408      +/-   ##
==========================================
+ Coverage   90.86%   91.39%   +0.52%     
==========================================
  Files        1023     1023              
  Lines       37044    37284     +240     
==========================================
+ Hits        33659    34074     +415     
+ Misses       3385     3210     -175     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benjaminwinger benjaminwinger merged commit 4bf92d6 into master Nov 18, 2023
12 checks passed
@benjaminwinger benjaminwinger deleted the dictionary-compression branch November 18, 2023 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants