String serialization #2304

benjaminwinger · 2023-10-30T14:59:33Z

This restructures string column storage to use an index column (UINT32), an offset column (UINT64) and a data column (UINT8). By storing strings as a sequence of characters I was able to remove the reliance on InMemOverflowFiles and allow strings to be split across multiple pages, removing the restriction that strings must be smaller than a page (in storage, there are I think still in-memory limits in ValueVectors which I haven't looked into removing), and laying the foundation for string dictionary compression (see the string dictionary compression section in #1474).

This is implemented by creating a variant of NodeColumn called AuxiliaryNodeColumn which handles data stored adjacent to a node group which can grow beyond NODE_GROUP_SIZE elements (BaseNodeColumn handles the shared functionality). Primarily I wanted to separate out the logic that handles the offsets to make sure that when fetching string data for a node group we don't make use of any of the functions which assume that each chunk contains NODE_GROUP_SIZE elements.
I think this might also be able to be used for VarListNodeColumn, but I haven't looked deeply into that.

Updates replacing values are not very efficient at the moment:

Out of place updates may be triggered even when data doesn't need to grow in size. We could flush the data back to the original location if this is detected. (Edit: Causing problems; not sure why)
[ ] Out of place updates modifying existing elements will leave unused entries in the string dictionary, since the old values are scanned into memory, updates are appended to the end, and then the resulting data flushed directly to disk. Dictionary de-duplication compression would fix this, but even for uncompressed string data we would probably want to write a custom flushBuffer function which prunes unused values. (handled for dictionary compression which I'm not including in this PR yet).

ray6080 · 2023-10-30T16:38:48Z

By storing strings as a sequence of characters I was able to remove the reliance on InMemOverflowFiles and allow strings to be split across multiple pages, removing the restriction that strings must be smaller than a page (in storage, there are I think still in-memory limits in ValueVectors which I haven't looked into removing)

The in-memory limits in ValueVectors is imposed by MemoryManager, which we should be able to relax to 256KB for now.

codecov · 2023-11-01T14:25:57Z

Codecov Report

Attention: 48 lines in your changes are missing coverage. Please review.

Comparison is base (093e627) 90.54% compared to head (54a2273) 90.42%.

Files	Patch %	Lines
src/storage/store/column.cpp	75.25%	24 Missing ⚠️
src/storage/compression/compression.cpp	68.33%	19 Missing ⚠️
src/storage/store/string_column.cpp	95.78%	4 Missing ⚠️
src/storage/store/string_column_chunk.cpp	97.29%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2304      +/-   ##
==========================================
- Coverage   90.54%   90.42%   -0.13%     
==========================================
  Files        1020     1020              
  Lines       36515    36641     +126     
==========================================
+ Hits        33064    33131      +67     
- Misses       3451     3510      +59

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benjaminwinger · 2023-11-09T23:16:28Z

I've done my best to reduce the overhead when scanning (a large part now seems to be caused by the std::function construction and destruction for the BufferManager::optimisticRead call. Unfortunately having both offset and data columns means we have roughly twice as many scans (since for each string we scan the offsets once and the data once). This has led to the runtime even after the optimizations to be about twice as slow when bulk scanning strings than before.

There will be some improvements for duplicate string values with the dictionary compression, as we can avoid scanning the duplicates multiple times, but that won't help for unduplicated data.

I think ideally we want to recognize when offsets and string data to be scanned is adjacent and scan data together or in batches. The fact that we scan values in chunks significantly smaller than the node group size doesn't help, as there's no guarantee that the string entries in the dictionary for that chunk will be adjacent, but particularly without re-writing values, and without dictionary compression (or with data which doesn't benefit from dictionary compression even with it implemented and enabled), it's likely that adjacent string indices refer to data adjacent in the offset and data columns.

As an aside, I think this also means we would get a noticeable speed boost for strings without dictionary compression if we supported storing them in two columns (offset+length and data) instead of three.

I also wonder what performance boost we would get if we switched from using std::function to c-style function pointers with some sort of extra data argument for passing contextual information (e.g. the Column object), as I believe those should have noticeably less overhead (both on master and with these changes, the many small reads that are required to read individual strings mean that the overhead on the scans is dominating (though not by a large margin) the time spent actually reading, and a large part of that overhead seems to be related to std::function). (Edit: Nope, performance was slightly worse with that change. Part of the issue, or maybe even most if it, is probably the functor being used to read from the frame. We have two levels of std::functions).

benjaminwinger · 2023-11-10T20:57:50Z

Before messing around too much with trying to combine adjacent string offset/data queries, I've benchmarked both this and the three yet unpublished parts I've written (enabling compression for string indices and offsets, reading strings directly into the result vector, and dictionary compression).

Copy Benchmarks

Copy benchmarks are done using hyperfine and the python benchmark/serializer.py tool. They measure the time taken to serialize the entirety of the 1GB ldbc dataset.

Baseline

b698dca

Benchmark 1: python serializer.py ldbc dataset/ldbc-1/ master /tmp/foo
  Time (mean ± σ):     12.763 s ±  0.106 s    [User: 35.962 s, System: 15.493 s]
  Range (min … max):   12.612 s … 12.972 s    10 runs

Data size: 1153272 bytes

String layout changes only

35b6ed0

Benchmark 1: python serializer.py ldbc dataset/ldbc-1/ strings /tmp/foo
  Time (mean ± σ):     12.941 s ±  0.125 s    [User: 36.114 s, System: 16.169 s]
  Range (min … max):   12.849 s … 13.240 s    10 runs

Data size: 1092528 bytes (new layout lets strings cross page boundaries, so there is less unused space).

Layout changes with compressed index and offset columns

Benchmark 1: python serializer.py ldbc dataset/ldbc-1/ offset-comp /tmp/foo
  Time (mean ± σ):     12.755 s ±  0.040 s    [User: 36.336 s, System: 16.141 s]
  Range (min … max):   12.706 s … 12.823 s    10 runs

Data size: 1012132 bytes

Dictionary compression

Still seems to have a bug with var lists of strings, but that doesn't actually affect this dataset.

Benchmark 1: python serializer.py ldbc dataset/ldbc-1/ dict /tmp/foo
  Time (mean ± σ):     12.532 s ±  0.082 s    [User: 37.954 s, System: 15.226 s]
  Range (min … max):   12.389 s … 12.655 s    10 runs

Data size: 917208 bytes

Query benchmarks

Queries are from the benchmark/queries directory and are run on the ldbc 1GB dataset.
For the most part, there is very little change, other than the overhead on string scans I mentioned earlier. That extra cost is largely recovered with dictionary compression when there is a lot of duplication, but not so much on columns without much duplication (e.g. var_size_seq_scan/q20 which scans the comment.locationIP column).
benchmarks.pdf

Insert benchmarks

Small increase in the time to insert. I'll try profiling, but I suspect it's just caused by the extra column modification.

Update: Profiling confirms that it's basically from the extra column, including an extra metadata update

Note that the maximum in some of these benchmarks is an outlier due to the filesystem caches.

Inserting 3000 nodes in a single transaction via the CLI

Before:

Time (mean ± σ):      4.028 s ±  0.009 s    [User: 1.939 s, System: 0.511 s]
Range (min … max):    4.013 s …  4.041 s    10 runs

With changes:

Time (mean ± σ):      4.049 s ±  0.023 s    [User: 1.961 s, System: 0.506 s]
Range (min … max):    4.031 s …  4.090 s    10 runs

Inserting 3000 nodes in a individual transactions via the CLI

Before:

Time (mean ± σ):      4.844 s ±  0.027 s    [User: 2.106 s, System: 1.230 s]
Range (min … max):    4.799 s …  4.902 s    10 runs

With Changes:

Time (mean ± σ):      5.219 s ±  0.050 s    [User: 2.171 s, System: 1.546 s]
Range (min … max):    5.168 s …  5.330 s    10 runs

Inserting 3000 nodes in individual transactions into the ldbc-1 comment table

Before:

Time (mean ± σ):     29.437 s ±  0.613 s    [User: 17.245 s, System: 11.557 s]
Range (min … max):   29.155 s … 31.173 s    10 runs

With changes:

Time (mean ± σ):     24.159 s ±  0.218 s    [User: 16.754 s, System: 6.652 s]
Range (min … max):   23.956 s … 24.666 s    10 runs

Inserting 3000 nodes in one transaction into the ldbc-10 comment table

Before:

Time (mean ± σ):      2.797 s ±  0.009 s    [User: 0.680 s, System: 0.580 s]
Range (min … max):    2.788 s …  2.811 s    10 runs

With Changes:

Time (mean ± σ):      2.858 s ±  0.009 s    [User: 0.709 s, System: 0.607 s]
Range (min … max):    2.842 s …  2.868 s    10 runs

Inserting 3000 nodes in individual transactions into the ldbc-10 comment table

Before:

Time (mean ± σ):     35.579 s ±  7.439 s    [User: 19.041 s, System: 14.205 s]
Range (min … max):   32.063 s … 56.632 s    10 runs

With Changes:

Time (mean ± σ):     28.485 s ±  2.402 s    [User: 16.942 s, System: 7.253 s]
Range (min … max):   25.517 s … 33.060 s    10 runs

ray6080 · 2023-11-14T06:02:09Z

src/include/common/constants.h

@@ -5,7 +5,7 @@
 namespace kuzu {
 namespace common {

-constexpr char KUZU_VERSION[] = "v0.0.12.1";
+constexpr char KUZU_VERSION[] = "v0.0.12.2";


Not related to this PR, but I wonder if we can replace this const with the projection version in our CMakeFile.

We can: #2428

src/include/storage/storage_info.h

ray6080 · 2023-11-14T06:17:05Z

src/include/storage/store/column.h


 using read_values_to_page_func_t =
    std::function<void(uint8_t* frame, PageElementCursor& pageCursor, uint8_t* result,
        uint32_t posInResult, uint64_t numValues, const CompressionMetadata& metadata)>;
 // This is a special usage for the `batchLookup` interface.
 using batch_lookup_func_t = read_values_to_page_func_t;

+struct ReadState {


ReadState -> ColumnReadState. I feel that latter is better in terms of less ambiguity as we also have other read states already.

How about Column::ReadState. That way, it can be entirely internal to the class.

src/include/storage/store/column.h

test/storage/compression_test.cpp

src/include/storage/store/column.h

src/storage/compression/compression.cpp

src/storage/store/column.cpp

This sets up the storage layout for compressing strings, and removes the limit on string size in storage.

ray6080 self-requested a review October 30, 2023 16:38

benjaminwinger force-pushed the string-serialization branch 3 times, most recently from 29a006b to 21fb0d9 Compare November 1, 2023 14:14

benjaminwinger force-pushed the string-serialization branch 9 times, most recently from 4bda12c to 34045d5 Compare November 6, 2023 18:51

benjaminwinger force-pushed the string-serialization branch 2 times, most recently from c55f74e to a1d0a4c Compare November 8, 2023 23:26

benjaminwinger marked this pull request as ready for review November 8, 2023 23:44

benjaminwinger force-pushed the string-serialization branch from a1d0a4c to 56cb7dc Compare November 9, 2023 15:34

benjaminwinger mentioned this pull request Nov 9, 2023

Shared lock access when reading metadata during scans is slow #2382

Closed

benjaminwinger force-pushed the string-serialization branch from 56cb7dc to 35b6ed0 Compare November 9, 2023 22:57

benjaminwinger force-pushed the string-serialization branch from 35b6ed0 to 6118341 Compare November 10, 2023 20:54

benjaminwinger force-pushed the string-serialization branch 2 times, most recently from 45535e4 to 9e03541 Compare November 13, 2023 20:30

benjaminwinger mentioned this pull request Nov 14, 2023

Dictionary compression #2408

Merged

benjaminwinger force-pushed the string-serialization branch 2 times, most recently from 1518e90 to a92953d Compare November 15, 2023 17:26

ray6080 approved these changes Nov 15, 2023

View reviewed changes

benjaminwinger force-pushed the string-serialization branch 4 times, most recently from cb3eb25 to feb92bb Compare November 16, 2023 19:34

benjaminwinger added 2 commits November 16, 2023 14:59

Rewrite string storage to use index, offset and data columns

b8af077

This sets up the storage layout for compressing strings, and removes the limit on string size in storage.

Check write bounds in columnchunk

54a2273

benjaminwinger force-pushed the string-serialization branch from feb92bb to 54a2273 Compare November 16, 2023 19:59

benjaminwinger merged commit bb159fe into master Nov 16, 2023
11 of 12 checks passed

benjaminwinger deleted the string-serialization branch November 16, 2023 21:54

This was referenced Nov 16, 2023

Structs containing strings and integers aren't updated out of place #2437

Closed

Always write a whole page at a time when writing compressed data #2438

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String serialization #2304

String serialization #2304

benjaminwinger commented Oct 30, 2023 •

edited

Loading

ray6080 commented Oct 30, 2023

codecov bot commented Nov 1, 2023 •

edited

Loading

benjaminwinger commented Nov 9, 2023 •

edited

Loading

benjaminwinger commented Nov 10, 2023 •

edited

Loading

ray6080 Nov 14, 2023

benjaminwinger Nov 16, 2023 •

edited

Loading

ray6080 Nov 14, 2023

benjaminwinger Nov 16, 2023

String serialization #2304

String serialization #2304

Conversation

benjaminwinger commented Oct 30, 2023 • edited Loading

ray6080 commented Oct 30, 2023

codecov bot commented Nov 1, 2023 • edited Loading

Codecov Report

benjaminwinger commented Nov 9, 2023 • edited Loading

benjaminwinger commented Nov 10, 2023 • edited Loading

Copy Benchmarks

Baseline

String layout changes only

Layout changes with compressed index and offset columns

Dictionary compression

Query benchmarks

Insert benchmarks

Inserting 3000 nodes in a single transaction via the CLI

Inserting 3000 nodes in a individual transactions via the CLI

Inserting 3000 nodes in individual transactions into the ldbc-1 comment table

Inserting 3000 nodes in one transaction into the ldbc-10 comment table

Inserting 3000 nodes in individual transactions into the ldbc-10 comment table

ray6080 Nov 14, 2023

Choose a reason for hiding this comment

benjaminwinger Nov 16, 2023 • edited Loading

Choose a reason for hiding this comment

ray6080 Nov 14, 2023

Choose a reason for hiding this comment

benjaminwinger Nov 16, 2023

Choose a reason for hiding this comment

benjaminwinger commented Oct 30, 2023 •

edited

Loading

codecov bot commented Nov 1, 2023 •

edited

Loading

benjaminwinger commented Nov 9, 2023 •

edited

Loading

benjaminwinger commented Nov 10, 2023 •

edited

Loading

benjaminwinger Nov 16, 2023 •

edited

Loading