Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String serialization #2304

Merged
merged 2 commits into from
Nov 16, 2023
Merged

String serialization #2304

merged 2 commits into from
Nov 16, 2023

Conversation

benjaminwinger
Copy link
Collaborator

@benjaminwinger benjaminwinger commented Oct 30, 2023

This restructures string column storage to use an index column (UINT32), an offset column (UINT64) and a data column (UINT8). By storing strings as a sequence of characters I was able to remove the reliance on InMemOverflowFiles and allow strings to be split across multiple pages, removing the restriction that strings must be smaller than a page (in storage, there are I think still in-memory limits in ValueVectors which I haven't looked into removing), and laying the foundation for string dictionary compression (see the string dictionary compression section in #1474).

This is implemented by creating a variant of NodeColumn called AuxiliaryNodeColumn which handles data stored adjacent to a node group which can grow beyond NODE_GROUP_SIZE elements (BaseNodeColumn handles the shared functionality). Primarily I wanted to separate out the logic that handles the offsets to make sure that when fetching string data for a node group we don't make use of any of the functions which assume that each chunk contains NODE_GROUP_SIZE elements.
I think this might also be able to be used for VarListNodeColumn, but I haven't looked deeply into that.

Updates replacing values are not very efficient at the moment:

  • Out of place updates may be triggered even when data doesn't need to grow in size. We could flush the data back to the original location if this is detected. (Edit: Causing problems; not sure why)
  • [ ] Out of place updates modifying existing elements will leave unused entries in the string dictionary, since the old values are scanned into memory, updates are appended to the end, and then the resulting data flushed directly to disk. Dictionary de-duplication compression would fix this, but even for uncompressed string data we would probably want to write a custom flushBuffer function which prunes unused values. (handled for dictionary compression which I'm not including in this PR yet).

@ray6080
Copy link
Contributor

ray6080 commented Oct 30, 2023

By storing strings as a sequence of characters I was able to remove the reliance on InMemOverflowFiles and allow strings to be split across multiple pages, removing the restriction that strings must be smaller than a page (in storage, there are I think still in-memory limits in ValueVectors which I haven't looked into removing)

The in-memory limits in ValueVectors is imposed by MemoryManager, which we should be able to relax to 256KB for now.

@ray6080 ray6080 self-requested a review October 30, 2023 16:38
@benjaminwinger benjaminwinger force-pushed the string-serialization branch 3 times, most recently from 29a006b to 21fb0d9 Compare November 1, 2023 14:14
Copy link

codecov bot commented Nov 1, 2023

Codecov Report

Attention: 48 lines in your changes are missing coverage. Please review.

Comparison is base (093e627) 90.54% compared to head (54a2273) 90.42%.

Files Patch % Lines
src/storage/store/column.cpp 75.25% 24 Missing ⚠️
src/storage/compression/compression.cpp 68.33% 19 Missing ⚠️
src/storage/store/string_column.cpp 95.78% 4 Missing ⚠️
src/storage/store/string_column_chunk.cpp 97.29% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2304      +/-   ##
==========================================
- Coverage   90.54%   90.42%   -0.13%     
==========================================
  Files        1020     1020              
  Lines       36515    36641     +126     
==========================================
+ Hits        33064    33131      +67     
- Misses       3451     3510      +59     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benjaminwinger
Copy link
Collaborator Author

benjaminwinger commented Nov 9, 2023

I've done my best to reduce the overhead when scanning (a large part now seems to be caused by the std::function construction and destruction for the BufferManager::optimisticRead call. Unfortunately having both offset and data columns means we have roughly twice as many scans (since for each string we scan the offsets once and the data once). This has led to the runtime even after the optimizations to be about twice as slow when bulk scanning strings than before.

There will be some improvements for duplicate string values with the dictionary compression, as we can avoid scanning the duplicates multiple times, but that won't help for unduplicated data.

I think ideally we want to recognize when offsets and string data to be scanned is adjacent and scan data together or in batches. The fact that we scan values in chunks significantly smaller than the node group size doesn't help, as there's no guarantee that the string entries in the dictionary for that chunk will be adjacent, but particularly without re-writing values, and without dictionary compression (or with data which doesn't benefit from dictionary compression even with it implemented and enabled), it's likely that adjacent string indices refer to data adjacent in the offset and data columns.

As an aside, I think this also means we would get a noticeable speed boost for strings without dictionary compression if we supported storing them in two columns (offset+length and data) instead of three.

I also wonder what performance boost we would get if we switched from using std::function to c-style function pointers with some sort of extra data argument for passing contextual information (e.g. the Column object), as I believe those should have noticeably less overhead (both on master and with these changes, the many small reads that are required to read individual strings mean that the overhead on the scans is dominating (though not by a large margin) the time spent actually reading, and a large part of that overhead seems to be related to std::function). (Edit: Nope, performance was slightly worse with that change. Part of the issue, or maybe even most if it, is probably the functor being used to read from the frame. We have two levels of std::functions).

@benjaminwinger
Copy link
Collaborator Author

benjaminwinger commented Nov 10, 2023

Before messing around too much with trying to combine adjacent string offset/data queries, I've benchmarked both this and the three yet unpublished parts I've written (enabling compression for string indices and offsets, reading strings directly into the result vector, and dictionary compression).

Copy Benchmarks

Copy benchmarks are done using hyperfine and the python benchmark/serializer.py tool. They measure the time taken to serialize the entirety of the 1GB ldbc dataset.

Baseline

b698dca

Benchmark 1: python serializer.py ldbc dataset/ldbc-1/ master /tmp/foo
  Time (mean ± σ):     12.763 s ±  0.106 s    [User: 35.962 s, System: 15.493 s]
  Range (min … max):   12.612 s … 12.972 s    10 runs

Data size: 1153272 bytes

String layout changes only

35b6ed0

Benchmark 1: python serializer.py ldbc dataset/ldbc-1/ strings /tmp/foo
  Time (mean ± σ):     12.941 s ±  0.125 s    [User: 36.114 s, System: 16.169 s]
  Range (min … max):   12.849 s … 13.240 s    10 runs

Data size: 1092528 bytes (new layout lets strings cross page boundaries, so there is less unused space).

Layout changes with compressed index and offset columns

Benchmark 1: python serializer.py ldbc dataset/ldbc-1/ offset-comp /tmp/foo
  Time (mean ± σ):     12.755 s ±  0.040 s    [User: 36.336 s, System: 16.141 s]
  Range (min … max):   12.706 s … 12.823 s    10 runs

Data size: 1012132 bytes

Dictionary compression

Still seems to have a bug with var lists of strings, but that doesn't actually affect this dataset.

Benchmark 1: python serializer.py ldbc dataset/ldbc-1/ dict /tmp/foo
  Time (mean ± σ):     12.532 s ±  0.082 s    [User: 37.954 s, System: 15.226 s]
  Range (min … max):   12.389 s … 12.655 s    10 runs

Data size: 917208 bytes

Query benchmarks

Queries are from the benchmark/queries directory and are run on the ldbc 1GB dataset.
For the most part, there is very little change, other than the overhead on string scans I mentioned earlier. That extra cost is largely recovered with dictionary compression when there is a lot of duplication, but not so much on columns without much duplication (e.g. var_size_seq_scan/q20 which scans the comment.locationIP column).
benchmarks.pdf

Insert benchmarks

Small increase in the time to insert. I'll try profiling, but I suspect it's just caused by the extra column modification.

Update: Profiling confirms that it's basically from the extra column, including an extra metadata update

Note that the maximum in some of these benchmarks is an outlier due to the filesystem caches.

Inserting 3000 nodes in a single transaction via the CLI

Before:

Time (mean ± σ):      4.028 s ±  0.009 s    [User: 1.939 s, System: 0.511 s]
Range (min … max):    4.013 s …  4.041 s    10 runs

With changes:

Time (mean ± σ):      4.049 s ±  0.023 s    [User: 1.961 s, System: 0.506 s]
Range (min … max):    4.031 s …  4.090 s    10 runs

Inserting 3000 nodes in a individual transactions via the CLI

Before:

Time (mean ± σ):      4.844 s ±  0.027 s    [User: 2.106 s, System: 1.230 s]
Range (min … max):    4.799 s …  4.902 s    10 runs

With Changes:

Time (mean ± σ):      5.219 s ±  0.050 s    [User: 2.171 s, System: 1.546 s]
Range (min … max):    5.168 s …  5.330 s    10 runs

Inserting 3000 nodes in individual transactions into the ldbc-1 comment table

Before:

Time (mean ± σ):     29.437 s ±  0.613 s    [User: 17.245 s, System: 11.557 s]
Range (min … max):   29.155 s … 31.173 s    10 runs

With changes:

Time (mean ± σ):     24.159 s ±  0.218 s    [User: 16.754 s, System: 6.652 s]
Range (min … max):   23.956 s … 24.666 s    10 runs

Inserting 3000 nodes in one transaction into the ldbc-10 comment table

Before:

Time (mean ± σ):      2.797 s ±  0.009 s    [User: 0.680 s, System: 0.580 s]
Range (min … max):    2.788 s …  2.811 s    10 runs

With Changes:

Time (mean ± σ):      2.858 s ±  0.009 s    [User: 0.709 s, System: 0.607 s]
Range (min … max):    2.842 s …  2.868 s    10 runs

Inserting 3000 nodes in individual transactions into the ldbc-10 comment table

Before:

Time (mean ± σ):     35.579 s ±  7.439 s    [User: 19.041 s, System: 14.205 s]
Range (min … max):   32.063 s … 56.632 s    10 runs

With Changes:

Time (mean ± σ):     28.485 s ±  2.402 s    [User: 16.942 s, System: 7.253 s]
Range (min … max):   25.517 s … 33.060 s    10 runs

@benjaminwinger benjaminwinger force-pushed the string-serialization branch 2 times, most recently from 45535e4 to 9e03541 Compare November 13, 2023 20:30
@benjaminwinger benjaminwinger force-pushed the string-serialization branch 2 times, most recently from 1518e90 to a92953d Compare November 15, 2023 17:26
@@ -5,7 +5,7 @@
namespace kuzu {
namespace common {

constexpr char KUZU_VERSION[] = "v0.0.12.1";
constexpr char KUZU_VERSION[] = "v0.0.12.2";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but I wonder if we can replace this const with the projection version in our CMakeFile.

Copy link
Collaborator Author

@benjaminwinger benjaminwinger Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can: #2428

src/include/storage/storage_info.h Outdated Show resolved Hide resolved

using read_values_to_page_func_t =
std::function<void(uint8_t* frame, PageElementCursor& pageCursor, uint8_t* result,
uint32_t posInResult, uint64_t numValues, const CompressionMetadata& metadata)>;
// This is a special usage for the `batchLookup` interface.
using batch_lookup_func_t = read_values_to_page_func_t;

struct ReadState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadState -> ColumnReadState. I feel that latter is better in terms of less ambiguity as we also have other read states already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about Column::ReadState. That way, it can be entirely internal to the class.

src/include/storage/store/column.h Outdated Show resolved Hide resolved
test/storage/compression_test.cpp Outdated Show resolved Hide resolved
src/include/storage/store/column.h Outdated Show resolved Hide resolved
src/storage/compression/compression.cpp Outdated Show resolved Hide resolved
src/storage/compression/compression.cpp Outdated Show resolved Hide resolved
src/storage/store/column.cpp Outdated Show resolved Hide resolved
@benjaminwinger benjaminwinger force-pushed the string-serialization branch 4 times, most recently from cb3eb25 to feb92bb Compare November 16, 2023 19:34
This sets up the storage layout for compressing strings, and removes the
limit on string size in storage.
@benjaminwinger benjaminwinger merged commit bb159fe into master Nov 16, 2023
11 of 12 checks passed
@benjaminwinger benjaminwinger deleted the string-serialization branch November 16, 2023 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants