-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String serialization #2304
String serialization #2304
Conversation
The in-memory limits in ValueVectors is imposed by MemoryManager, which we should be able to relax to 256KB for now. |
29a006b
to
21fb0d9
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #2304 +/- ##
==========================================
- Coverage 90.54% 90.42% -0.13%
==========================================
Files 1020 1020
Lines 36515 36641 +126
==========================================
+ Hits 33064 33131 +67
- Misses 3451 3510 +59 ☔ View full report in Codecov by Sentry. |
4bda12c
to
34045d5
Compare
c55f74e
to
a1d0a4c
Compare
a1d0a4c
to
56cb7dc
Compare
56cb7dc
to
35b6ed0
Compare
I've done my best to reduce the overhead when scanning (a large part now seems to be caused by the std::function construction and destruction for the There will be some improvements for duplicate string values with the dictionary compression, as we can avoid scanning the duplicates multiple times, but that won't help for unduplicated data. I think ideally we want to recognize when offsets and string data to be scanned is adjacent and scan data together or in batches. The fact that we scan values in chunks significantly smaller than the node group size doesn't help, as there's no guarantee that the string entries in the dictionary for that chunk will be adjacent, but particularly without re-writing values, and without dictionary compression (or with data which doesn't benefit from dictionary compression even with it implemented and enabled), it's likely that adjacent string indices refer to data adjacent in the offset and data columns. As an aside, I think this also means we would get a noticeable speed boost for strings without dictionary compression if we supported storing them in two columns (offset+length and data) instead of three.
|
35b6ed0
to
6118341
Compare
Before messing around too much with trying to combine adjacent string offset/data queries, I've benchmarked both this and the three yet unpublished parts I've written (enabling compression for string indices and offsets, reading strings directly into the result vector, and dictionary compression). Copy BenchmarksCopy benchmarks are done using hyperfine and the python Baseline
Data size: 1153272 bytes String layout changes only
Data size: 1092528 bytes (new layout lets strings cross page boundaries, so there is less unused space). Layout changes with compressed index and offset columns
Data size: 1012132 bytes Dictionary compressionStill seems to have a bug with var lists of strings, but that doesn't actually affect this dataset.
Data size: 917208 bytes Query benchmarksQueries are from the Insert benchmarksSmall increase in the time to insert. I'll try profiling, but I suspect it's just caused by the extra column modification. Update: Profiling confirms that it's basically from the extra column, including an extra metadata update Note that the maximum in some of these benchmarks is an outlier due to the filesystem caches. Inserting 3000 nodes in a single transaction via the CLIBefore:
With changes:
Inserting 3000 nodes in a individual transactions via the CLIBefore:
With Changes:
Inserting 3000 nodes in individual transactions into the ldbc-1 comment tableBefore:
With changes:
Inserting 3000 nodes in one transaction into the ldbc-10 comment tableBefore:
With Changes:
Inserting 3000 nodes in individual transactions into the ldbc-10 comment tableBefore:
With Changes:
|
45535e4
to
9e03541
Compare
1518e90
to
a92953d
Compare
src/include/common/constants.h
Outdated
@@ -5,7 +5,7 @@ | |||
namespace kuzu { | |||
namespace common { | |||
|
|||
constexpr char KUZU_VERSION[] = "v0.0.12.1"; | |||
constexpr char KUZU_VERSION[] = "v0.0.12.2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but I wonder if we can replace this const with the projection version in our CMakeFile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can: #2428
src/include/storage/store/column.h
Outdated
|
||
using read_values_to_page_func_t = | ||
std::function<void(uint8_t* frame, PageElementCursor& pageCursor, uint8_t* result, | ||
uint32_t posInResult, uint64_t numValues, const CompressionMetadata& metadata)>; | ||
// This is a special usage for the `batchLookup` interface. | ||
using batch_lookup_func_t = read_values_to_page_func_t; | ||
|
||
struct ReadState { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadState -> ColumnReadState. I feel that latter is better in terms of less ambiguity as we also have other read states already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about Column::ReadState
. That way, it can be entirely internal to the class.
cb3eb25
to
feb92bb
Compare
This sets up the storage layout for compressing strings, and removes the limit on string size in storage.
feb92bb
to
54a2273
Compare
This restructures string column storage to use an index column (
UINT32
), an offset column (UINT64
) and a data column (UINT8
). By storing strings as a sequence of characters I was able to remove the reliance on InMemOverflowFiles and allow strings to be split across multiple pages, removing the restriction that strings must be smaller than a page (in storage, there are I think still in-memory limits inValueVector
s which I haven't looked into removing), and laying the foundation for string dictionary compression (see the string dictionary compression section in #1474).This is implemented by creating a variant of
NodeColumn
calledAuxiliaryNodeColumn
which handles data stored adjacent to a node group which can grow beyondNODE_GROUP_SIZE
elements (BaseNodeColumn
handles the shared functionality). Primarily I wanted to separate out the logic that handles the offsets to make sure that when fetching string data for a node group we don't make use of any of the functions which assume that each chunk containsNODE_GROUP_SIZE
elements.I think this might also be able to be used for VarListNodeColumn, but I haven't looked deeply into that.
Updates replacing values are not very efficient at the moment:
[ ] Out of place updates modifying existing elements will leave unused entries in the string dictionary, since the old values are scanned into memory, updates are appended to the end, and then the resulting data flushed directly to disk. Dictionary de-duplication compression would fix this, but even for uncompressed string data we would probably want to write a custom flushBuffer function which prunes unused values.(handled for dictionary compression which I'm not including in this PR yet).