Dictionary compression #2408

benjaminwinger · 2023-11-14T15:40:36Z

Note that this depends on #2304

This is implemented inside of StringColumn and StringColumnChunk. With the string column layout being hard-coded into the class, it doesn't really seem to fit the abstractions used for the compression of single chunks.

Implemented in two parts:

When copying to the StringColumnChunk, values are de-duplicated as they are stored. This significant memory overhead for data without much duplication, since every string inserted is stored both in the chunk and in an unordered_map that's used to look up the indices of elements already inserted (a copy must be made since references won't survive when the data chunk is resized).
Before flushing to the disk, the prepareForFlush method goes over the data again and de-duplicates it. In this case references to the original data are stable and we can use string_view instead of string. This is necessary if we want to remove unused entries, e.g. when strings are read from the disk, existing values are modified, and the original values are no longer used.
I've also optimized the scans to avoid both scanning strings multiple times, and copying strings multiple times into a single ValueVector. This is independent of the other parts but speeds up scans for duplicated data.

If the extra memory overhead when copying is a concern, we could remove (1), since (2) is sufficient for on-disk de-duplication, but (1) should speed up copying when there is a lot of duplication (I don't have any benchmarks for this though, and will do some benchmarking and profiling to compare the costs if I have time).

It would also be possible to reduce the memory overhead of (1) by using string_views and updating them to point to the new buffer after a resize, but that will also add some new costs.

src/include/storage/store/string_column_chunk.h

src/storage/store/string_column_chunk.cpp

src/include/storage/store/string_column_chunk.h

src/storage/store/string_column.cpp

src/storage/store/string_column_chunk.cpp

codecov · 2023-11-17T23:17:18Z

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (7ef3d5c) 90.86% compared to head (92bd3e4) 91.39%.
Report is 13 commits behind head on master.

Files	Patch %	Lines
src/storage/store/string_column_chunk.cpp	92.30%	3 Missing ⚠️
src/common/types/ku_string.cpp	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2408      +/-   ##
==========================================
+ Coverage   90.86%   91.39%   +0.52%     
==========================================
  Files        1023     1023              
  Lines       37044    37284     +240     
==========================================
+ Hits        33659    34074     +415     
+ Misses       3385     3210     -175

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benjaminwinger force-pushed the dictionary-compression branch from 0b3cb66 to 4307f0c Compare November 14, 2023 15:54

ray6080 approved these changes Nov 14, 2023

View reviewed changes

benjaminwinger force-pushed the dictionary-compression branch 2 times, most recently from d13813d to ed46f11 Compare November 14, 2023 23:08

benjaminwinger force-pushed the string-serialization branch 2 times, most recently from 1518e90 to a92953d Compare November 15, 2023 17:26

benjaminwinger force-pushed the dictionary-compression branch from ed46f11 to 7f32cac Compare November 15, 2023 21:40

benjaminwinger force-pushed the string-serialization branch 5 times, most recently from feb92bb to 54a2273 Compare November 16, 2023 19:59

Base automatically changed from string-serialization to master November 16, 2023 21:54

benjaminwinger mentioned this pull request Nov 16, 2023

Always write a whole page at a time when writing compressed data #2438

Merged

benjaminwinger force-pushed the dictionary-compression branch 2 times, most recently from 15c1ed8 to 64a5a36 Compare November 17, 2023 21:55

Add string dictionary compression

d019436

benjaminwinger force-pushed the dictionary-compression branch from 64a5a36 to 9ace9bf Compare November 17, 2023 23:07

Only scan duplicated string values once

a66a925

benjaminwinger force-pushed the dictionary-compression branch from 9ace9bf to a66a925 Compare November 18, 2023 03:14

Trigger reload

92bd3e4

benjaminwinger merged commit 4bf92d6 into master Nov 18, 2023
12 checks passed

benjaminwinger deleted the dictionary-compression branch November 18, 2023 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary compression #2408

Dictionary compression #2408

benjaminwinger commented Nov 14, 2023

codecov bot commented Nov 17, 2023 •

edited

Loading

Dictionary compression #2408

Dictionary compression #2408

Conversation

benjaminwinger commented Nov 14, 2023

codecov bot commented Nov 17, 2023 • edited Loading

Codecov Report

codecov bot commented Nov 17, 2023 •

edited

Loading