Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of VAR_LIST storage layout #3093

Merged
merged 1 commit into from
Mar 24, 2024

Conversation

hououou
Copy link
Collaborator

@hououou hououou commented Mar 20, 2024

This PR is to improve the performance of var #3060.

First, we now support in-place commits for VAR_LIST. We separately commit the offset column, size column, and data column. Previously, we do not support in-place commit for VAR_LIST. Thus, every time we update an item, it will trigger an out-place commit: scanning the whole column, updating the column, and writing the updated column into the persistence storage.

Second, the design of #3060 is good for writing performance. But it is bad for scan performance since it may cause random access to the data column. If the offset is not in ascending order, we need to scan the list one by one. Worse, the data column will be increased with every update/insert. To balance write and read performance, we rewrite the whole var list column in ascending order at some point. Now we rewrite it when we do an out-place commit and the size of the data column is larger than a threshold(Currently, we use capacity/2).

There are still some spaces we can optimize the storage layout of VAR_LIST.

  1. We now can only do out-place commits for the offset column because of some interfaces. A better way is to store the offset in another offset column but not in the buffer of the VAR_LIST column itself, just like the size column.
  2. We should support scanning the data in non-empty column chunk. Then we can move tmpDataColumnChunk in VarListColumn::scan(Transaction* transaction, node_group_idx_t nodeGroupIdx, kuzu::storage::ColumnChunk* columnChunk, offset_t startOffset, offset_t endOffset) .
  3. If a list is not in ascending order, we now scan the list items one by one. A better way is to scan in groups, trying to scan out those small parts of contiguous data together as much as possible.

I also ran a benchmark to test the performance of the current VAR_LIST storage layout. The dataset only has one person(id int64, age int64[]) table and one knows table(from person to person, length int64) table. The dataset has 10000000 nodes(around 1GB) and 100000000 edges (around 11GB). The results are as follows. The writing performance has improved a lot. The reading performance is similar to the previous implementation except for scanning the whole edge table (maybe caused by the above second bottleneck).

var_list_storage

@hououou hououou changed the base branch from master to remove_var_list_index_column March 20, 2024 04:07
@hououou hououou changed the base branch from remove_var_list_index_column to master March 20, 2024 19:07
@hououou hououou marked this pull request as ready for review March 20, 2024 19:08
Copy link

codecov bot commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 93.42561% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 91.93%. Comparing base (f9bc0c6) to head (6870d28).
Report is 3 commits behind head on master.

Files Patch % Lines
src/storage/store/var_list_column.cpp 90.06% 15 Missing ⚠️
src/storage/store/var_list_column_chunk.cpp 96.66% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3093      +/-   ##
==========================================
+ Coverage   91.89%   91.93%   +0.04%     
==========================================
  Files        1169     1171       +2     
  Lines       43774    43982     +208     
==========================================
+ Hits        40227    40436     +209     
+ Misses       3547     3546       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hououou hououou changed the title VAR_LIST in place commit Improve the performance of VAR_LIST storage layout Mar 20, 2024
@hououou hououou requested a review from ray6080 March 20, 2024 19:45
Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also bump the last digit of the CMake project version.

Haven't finished reviewing all of these changes, I will continue later.

@@ -792,7 +792,8 @@ void Column::commitColumnChunkOutOfPlace(Transaction* transaction, node_group_id
auto chunkMeta = getMetadata(nodeGroupIdx, transaction->getType());
// TODO(Guodong): Should consider caching the scanned column chunk to avoid redundant
// scans in the same transaction.
auto columnChunk = getEmptyChunkForCommit(chunkMeta.numValues + dstOffsets.size());
auto columnChunk =
getEmptyChunkForCommit(1.5 * std::bit_ceil(chunkMeta.numValues + dstOffsets.size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 1.5*? Is there performance implications behind this?

src/storage/store/var_list_column.cpp Outdated Show resolved Hide resolved
src/storage/store/var_list_column_chunk.cpp Outdated Show resolved Hide resolved
src/include/storage/store/var_list_column_chunk.h Outdated Show resolved Hide resolved
src/include/storage/store/var_list_column.h Outdated Show resolved Hide resolved
src/include/storage/store/var_list_column.h Show resolved Hide resolved
src/storage/store/var_list_column.cpp Outdated Show resolved Hide resolved
startVarListOffset, endVarListOffset);
varListColumnChunk->resetOffset();
} else {
varListColumnChunk->resizeDataColumnChunk(std::bit_ceil(resizeNumValues));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path is not well covered by tests. Can you add tests that first perform a bunch of in place updates, then perform reads?

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take another look after you've addressed comments.

src/storage/store/var_list_column.cpp Outdated Show resolved Hide resolved
src/storage/store/var_list_column.cpp Outdated Show resolved Hide resolved
src/storage/store/var_list_column.cpp Outdated Show resolved Hide resolved
src/storage/store/var_list_column.cpp Outdated Show resolved Hide resolved
Comment on lines 356 to 358
for (auto i = 0u; i < appendListNum; i++) {
auto appendLen = varListChunk->getListLen(startSrcOffset + i);
for (auto j = 0u; j < appendLen; j++) {
dstOffsetsInDataColumn.push_back(dataColumnSize + appendListDataNum + j);
}
appendListDataNum += appendLen;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

        for (auto i = 0u; i < appendListNum; i++) {
            for (auto j = 0u; j < appendLen; j++) {
                dstOffsetsInDataColumn.push_back(dataColumnSize++);
            }
        }

btw, you can reuse this function under rel_table_data.h

    inline void fillSequence(std::span<common::offset_t> offsets, common::offset_t startOffset) {
        for (auto i = 0u; i < offsets.size(); i++) {
            offsets[i] = i + startOffset;
        }
    }

// in the end of list data column we need to plus to original data column size to get the
// new offset
// TODO(Jiamin): A better way is to store the offset in a offset column, just like size
// column. Then we can reuse prepareCommitForChunk interface for offset column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should apply this optimization. You can separate it in another PR, but we should do it soon.

src/storage/store/var_list_column_chunk.cpp Outdated Show resolved Hide resolved
src/storage/store/var_list_column_chunk.cpp Outdated Show resolved Hide resolved
src/storage/store/var_list_column_chunk.cpp Outdated Show resolved Hide resolved
src/storage/store/var_list_column_chunk.cpp Outdated Show resolved Hide resolved
@ray6080
Copy link
Contributor

ray6080 commented Mar 22, 2024

Can we close #3060 and #3019 now?

This was referenced Mar 23, 2024
rework var list

improve performance

trigger ci

solve comments

fix

bump version
@ray6080 ray6080 merged commit 599b80f into master Mar 24, 2024
17 checks passed
@ray6080 ray6080 deleted the var_list_in_place_commit branch March 24, 2024 04:19
@hououou hououou mentioned this pull request Apr 15, 2024
32 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants