Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node group based rel table #2246

Merged
merged 1 commit into from
Nov 2, 2023
Merged

Node group based rel table #2246

merged 1 commit into from
Nov 2, 2023

Conversation

ray6080
Copy link
Contributor

@ray6080 ray6080 commented Oct 23, 2023

This is a quite bulky PR, sorry for that :).
Mainly this PR introduces several major changes:

  • Moved rel table to be stored as node groups.
  • Remove arrow from third party dependencies. Now all reads are done through our customized CSV and Parquet readers.
  • A bunch of clean ups, including removing legacy data structures for rel tables, such as Lists, Columns, etc., cleaning up code related to StorageStructure and a bunch of util functions for column and list files.

Note that this PR breaks updates, creations, deletions for rel tables for a short period of time
We discussed this internally inside the team, though it's not a good practice to break master, we decided to merge this early to save efforts for everyone to sync around monster changes brought by this PR.

Rel Table Storage

A rel table consists of both forward and backward RelTableData, which consists of multiple Columns.
Each Column can be stored in two formats: REGULAR and CSR. Regular columns are used by both NodeTable and RelTable when the multiplicity is ONE/MANY_TO_ONE, thus, each node offset corresponds to one value in the column. CSR columns are used only by RelTable when the multiplicity is ONE/MANY_TO_MANY, where each node offset corresponds to a CSR list in the column.

For a node group of RelTableData stored in CSR format, it contains following column chunks:

  • CSR offsets: stores for each node in the node group its csr offset. The column chunk contains NODE_GROUP_SIZE values, each of which records the end offset of the CSR of the node with the corresponding offset. For example, csr offsets are [3, 5, 5, 9, ... ], meaning the first CSR list's size is 3, and the second is 2, etc.
    When reading data from a node group, we always need to first read out its csr offsets into RelDataReadState, then read actual data columns based on the offsets.
  • Adj column: stores neighbour node ids. we should consider renaming this to nbrColumn later.
  • Properties: rel properties defined in the table schema.

COPY

COPY pipelines for rel tables are as follows:

            __RESULT_COLLECTOR__
           |           |       |
COPY_REL(fwd)   COPY_REL(bwd)  PARTITIONER     
                                    |
                              INDEX_LOOKUP
                                    |
                                 READER

The main part is executed as three pipelines.
The right most PARTITIONER pipeline gets executed first. It performs partitioning of all tuples based on their src and dst node offsets. We copy tuples to each partitioning buffer as for now and organize them into different partitions (i.e., node groups). This could contribute to large memory consumption when the copied dataset is large. The following solution to this is to allow spilling intermediate partitioning result to disk as tmp files, which introduces I/O overheads, but the I/O pattern should be always be sequential reads and writes.
Other two pipelines are both CopyRel for each directions, which consume partitioned buffer from the PARTITIONER pipeline. During execution, each thread will fetch one partition (i.e., node group) at a time, construct necessary column chunks for the node group, and append them to table data file.

Note that this design is not able to handle very skewed dataset well. When one node group is extremely large, the copy rel pipeline could end up being only one thread working during most of the time, we should perhaps address this by further partitioning the extremely large node group and let multiple threads work concurrently on it.

TODOs

  • Throw error messages in the front end for disabled updates/deletions/insertions on rel table.

Tests

  • Add nested data types for rel table: STRUCT, MAP.
  • Add nested data types for rel table: UNION.
  • Fix e2e ddl tests.
  • Fix RDF tests. CopyRDFTest and RdfoxExample.Basic.
  • Fix CopyRelTableMultiplicityViolationTest and TinySnbExceptionTest.DeleteNodeWithEdgeErrorTest.
  • Fix TinySnbCreateNodeTest, TinySnbCreateReadRelTest, TinySnbDeleteRelTest, TinySnbMergeRelTest, TinySnbSetReadRelTest, UpdateRelTest, DeleteRelTest, CreateRelTest, TCK, DemoDBCreateTest, DemoDBDeleteTest.

Refactoring

  • Change RelDataDirection from enum to enum class.
  • Add OFFSET logical data type Add OFFSET logical type #2331.
  • Address this comment on getting rid of the converting binding type of SERIAL to INT64.
  • Address this comment on refactoring copyDataToPartitions function.
  • Revisit VAR_LIST storage layout. Existing one is not write friendly. Consider the alternative way similar to STRING (indices, offsets, data). I will open a separate issue on this.
  • Move offset and adj column to properties.
  • Get rid of leftNumRows inside Reader::readNextDataChunk().
  • Turn on compression on internalID.
  • Move partitioner buffer to be managed by MemoryManager. To do that, we should introduce a column-oriented data chunk collection (or more generally, result set collection).

@ray6080 ray6080 force-pushed the node-group branch 3 times, most recently from 762f950 to e925eea Compare October 27, 2023 02:44
@ray6080 ray6080 marked this pull request as ready for review October 27, 2023 02:44
src/binder/bind/bind_copy.cpp Outdated Show resolved Hide resolved
@@ -104,8 +101,6 @@ void Catalog::addRelProperty(
initCatalogContentForWriteTrxIfNecessary();
catalogContentForWriteTrx->getTableSchema(tableID)->addRelProperty(
propertyName, std::move(dataType));
wal->logAddPropertyRecord(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we log dropProperty in the catalog, but not log addProperty there?

test/test_files/transaction/test.test Outdated Show resolved Hide resolved
test/test_files/tinysnb/var_length_extend/multi_label.test Outdated Show resolved Hide resolved
test/runner/e2e_test.cpp Outdated Show resolved Hide resolved
src/include/storage/store/rel_table_data.h Outdated Show resolved Hide resolved
src/processor/operator/partitioner.cpp Show resolved Hide resolved
src/processor/operator/partitioner.cpp Show resolved Hide resolved
src/processor/operator/partitioner.cpp Show resolved Hide resolved
src/binder/bind/bind_copy.cpp Show resolved Hide resolved
src/binder/bind/bind_copy.cpp Show resolved Hide resolved
src/include/binder/copy/bound_copy_from.h Outdated Show resolved Hide resolved
src/include/planner/operator/logical_partitioner.h Outdated Show resolved Hide resolved
src/planner/plan/plan_copy.cpp Show resolved Hide resolved
src/processor/map/map_copy_from.cpp Show resolved Hide resolved
src/processor/map/map_copy_from.cpp Show resolved Hide resolved
src/processor/map/map_copy_from.cpp Show resolved Hide resolved
src/include/processor/operator/partitioner.h Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 2, 2023

Codecov Report

Attention: 125 lines in your changes are missing coverage. Please review.

Comparison is base (94a611c) 89.82% compared to head (b89f443) 88.61%.
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2246      +/-   ##
==========================================
- Coverage   89.82%   88.61%   -1.21%     
==========================================
  Files        1022     1002      -20     
  Lines       35452    32893    -2559     
==========================================
- Hits        31845    29149    -2696     
- Misses       3607     3744     +137     
Files Coverage Δ
src/catalog/catalog.cpp 83.33% <100.00%> (-16.67%) ⬇️
src/catalog/catalog_content.cpp 88.57% <100.00%> (-11.43%) ⬇️
src/common/copier_config/copier_config.cpp 54.54% <ø> (ø)
src/common/types/types.cpp 0.00% <ø> (ø)
src/common/vector/auxiliary_buffer.cpp 97.95% <ø> (-0.05%) ⬇️
src/common/vector/value_vector.cpp 92.70% <100.00%> (+0.74%) ⬆️
src/include/binder/copy/bound_copy_from.h 77.41% <100.00%> (-22.59%) ⬇️
src/include/catalog/catalog.h 100.00% <ø> (ø)
src/include/catalog/catalog_content.h 95.00% <ø> (ø)
src/include/catalog/table_schema.h 86.20% <100.00%> (-13.80%) ⬇️
... and 132 more

... and 69 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ray6080 ray6080 force-pushed the node-group branch 3 times, most recently from ae1547f to 5941d39 Compare November 2, 2023 05:20
@ray6080 ray6080 merged commit 2f30023 into master Nov 2, 2023
12 checks passed
@ray6080 ray6080 deleted the node-group branch November 2, 2023 18:59
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Riolku added a commit that referenced this pull request Nov 3, 2023
The bulk of the warnings issued are because of unused includes, but some
are because of dead files leftover from #2246. Also, `subplans_table.h`
had a `using` directive, which caused symbol pollution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants