Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node group-based node table storage #1802

Merged
merged 1 commit into from
Aug 2, 2023
Merged

Node group-based node table storage #1802

merged 1 commit into from
Aug 2, 2023

Conversation

ray6080
Copy link
Contributor

@ray6080 ray6080 commented Jul 11, 2023

This is the first major PR towards node group-based storage.

The implementation is detailed as follow:

Column Chunk

ColumnChunk is used during COPY. It buffers all values of one column within a node group.

class ColumnChunk {
    ... ...
    std::unique_ptr<uint8_t[]> buffer;
    std::unique_ptr<NullColumnChunk> nullChunk;
    std::vector<std::unique_ptr<ColumnChunk>> childrenChunks; // for nested data types. (currently, only STRUCT will make use of this.)
    ... ...
}

Basic methods of setting/getting and copying values into the ColumnChunk is essentially same as InMemColumnChunk, which is still used to populate columns in rel tables. Eventually, these two will be merged together.

Node Group

NodeGroup is a wrapper of all ColumnChunks from node properties with the same set of node offsets.
The class is also used during COPY.

class NodeGroup {
    ... ...
    uint64_t nodeGroupIdx;
    common::offset_t numNodes;
    std::unordered_map<common::property_id_t, std::unique_ptr<ColumnChunk>> chunks;
    ... ...
}

Copy Node

COPY node pipeline consists of two operators, ReadFile and CopyNode. The former one is responsible for reading a chunk from files, and feed to CopyNode, while CopyNode populates the node table correspondingly.
The logic of CopyNode is as follows:

  • Each thread holds a local node group, which is reused to populate data pulled from ReadFile. Once full, it is flushed out and reset afterwards. (CopyNode::appendNodeGroupToTableAndPopulateIndex). When ReadFile is exhausted, data left in the local node group is merged into a shared one among all threads (to avoid flushing many non-full node groups). Finally, the last thread will gaurantee all data left in the shared node group will be flushed to disk (CopyNode::finalize).
  • Each node group is assigned a node group idx (CopyNodeSharedState::getNextNodeGroupIdx) when it is flushed out (NodeTable::appendNodeGroup). The start node offset of the node group is calculated based on the given node group idx (nodeGroupIdx << NODE_GROUP_SIZE_LOG2). Also, the node group idx is used to access its metadata in the meta disk array (columnChunksMetaDA->get(nodeGroupIdx) used in NodeColumn). The assignment of node group idx is coordinated through the shared state of CopyNode.
  • When flushing to disk, each column chunk will be assigned a set of sequential pages, and each column chunk will start from offset 0 of the first page.

Note that this design is not order presvering, meaning that the ordering of nodes in the internal storage is not guranteed to be the same as they are in original csv/parquet files.
Order preserving COPY will be added later along with the fix of SERIAL.

Column Chunk Metadata

Catalog holds the file handle of metadata file, which consists of disk arrays of ColumnChunkMetadata for each column.

struct ColumnChunkMetadata {
    common::page_idx_t pageIdx = common::INVALID_PAGE_IDX;
    common::page_idx_t numPages = 0;
};

The ColumnChunkMetadata is used to keep track of starting page idx and num pages for the column chunk, so we can correctly read it back from disk.

class Catalog {
	... ...
	std::unique_ptr<storage::BMFileHandle> nodeGroupsMetaFH;
    ... ...
}

Besides, Property holds MetaDiskArrayHeaderInfo, which keeps track of related disk arrays of this property. (There can be multiple disk arrays due to that we separate data, null, and nested children into different column chunks).

struct MetaDiskArrayHeaderInfo {
    common::page_idx_t mainHeaderPageIdx = common::INVALID_PAGE_IDX;
    common::page_idx_t nullHeaderPageIdx = common::INVALID_PAGE_IDX;
    std::vector<MetaDiskArrayHeaderInfo> childrenMetaDAHeaderInfos;
};

class Property {
	... ...
    MetaDiskArrayHeaderInfo metaDiskArrayHeaderInfo;
	... ...
}

Node Column

Serves similar functionality as Column. Eventually these two should be merged into one.
A node column holds a disk array of column chunk metadata.
Given a starting offset, read starts with calculating the node group idx (nodeOffset >> common::StorageConstants::NODE_GROUP_SIZE_LOG2), and then access the column chunk metadata to get the starting page idx in the data file for the column chunk, finally read necessary pages accordingly (same logic as before this PR).

Same for write, given any node offset, we first need to get the corresponding column chunk metadata to figure out the starting page idx, then the write logic is the same.

(Note: for scans, we can optimize this a bit later to avoid repeated calculating and accessing of the column chunk metadata. if we change the morsel to a node group, the scan operator should keep a scan state, and only need to get the column chunk metadata during the scan of the first vector within a node group, following scans within the same node group can reuse the metadata).

TODOs

Here is a list of stuff broken right now, which is on the way to be fixed:

  • SERIAL data type.
  • COPY npy.
  • WAL-based transaction mechanism for the initialization of metadataDA.
  • Fix add property and add transaction tests on add property and copy node (rollback, recovery, etc.). This should refactor some existing tests on dll and copy node.
  • Local-storage based updates on variable-sized values.
  • Fix copy node error messages.
  • Add file truncation logic during replaying CopyNode wal record.

Immediate following works of this PR include (not ordered here):

  • Benchmark the idea of changing the morsel of ReadFile to a node group at one time, so CopyNode can be simplified.
  • LIST layout rework.
  • STRING layout rework.
  • Constant compression.
  • Bit-packing compression.
  • Node group-based rel table storage.

@ray6080 ray6080 force-pushed the node-group branch 2 times, most recently from f26db56 to 0477f05 Compare July 24, 2023 05:47
@ray6080 ray6080 marked this pull request as ready for review July 24, 2023 13:24
@ray6080 ray6080 changed the title [WIP] Node group-based node table storage Node group-based node table storage Jul 24, 2023
src/catalog/catalog.cpp Outdated Show resolved Hide resolved
src/catalog/catalog.cpp Outdated Show resolved Hide resolved
src/include/catalog/catalog_structs.h Outdated Show resolved Hide resolved
src/catalog/catalog.cpp Outdated Show resolved Hide resolved
src/catalog/catalog.cpp Outdated Show resolved Hide resolved
src/processor/operator/ddl/add_property.cpp Outdated Show resolved Hide resolved
src/storage/wal_replayer.cpp Outdated Show resolved Hide resolved
src/storage/storage_structure/var_sized_node_column.cpp Outdated Show resolved Hide resolved
src/storage/storage_structure/var_sized_node_column.cpp Outdated Show resolved Hide resolved
src/storage/storage_utils.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@acquamarin acquamarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting a seg fault while loading the ldbc-100 comment csv on my MAC. Looks like the readfile operator is trying to access a valuevector which has a null dataChunk state.

src/include/processor/operator/copy/copy_node.h Outdated Show resolved Hide resolved
src/include/processor/operator/copy/copy_node.h Outdated Show resolved Hide resolved
src/processor/operator/copy/copy_node.cpp Outdated Show resolved Hide resolved
src/processor/operator/copy/copy_node.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@semihsalihoglu-uw semihsalihoglu-uw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in the middle of my review. I'll continue tonight but in case you can start on this, here's my initial set of comments.

src/include/catalog/catalog.h Outdated Show resolved Hide resolved
src/catalog/catalog.cpp Outdated Show resolved Hide resolved
src/catalog/catalog.cpp Outdated Show resolved Hide resolved
src/catalog/catalog.cpp Outdated Show resolved Hide resolved
protected:
std::unique_ptr<CatalogContent> catalogContentForReadOnlyTrx;
std::unique_ptr<CatalogContent> catalogContentForWriteTrx;
storage::WAL* wal;
std::unique_ptr<storage::BMFileHandle> nodeGroupsMetaFH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right place to store this FH? It looks like this might belong to a class in StorageManager.

Also the name Meta does not sound correct. Maybe kzMetadataFileFH or metadataFileFH or metadataFH. I have a few suggestions around this (e.g., in constants.h or storagemanager.h). Whatever you decide about these file names, just be consistent in every field/variable/constant etc.

Copy link
Contributor Author

@ray6080 ray6080 Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still kept in Catalog for now. Honestly, I'm not sure as to whether Catalog or StorageManager is the best place. I want to revisit this as we move on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Catalog clearly is not a place to keep track of disk-related fields, such as filehandles. In addition this field already exists (as raw pointer) in node_table.h, which makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was regardless logical (schema) or physical (metadata), these information are metadata info around physically stored table data. It depends on if we should separate logical and physical or not. I was a bit indecisive, but now I think separating them makes more sense. So I will make the change.

I tried to move metadataFH to StorageManager. The part I like is that it avoids involving Catalog to interact with metadata file, while the annoying part is that during wal replaying, when it comes to recovery, because there is no StorageManager present, we need to somehow (re)construct a metadataFH, which i don't think it's a correct design.

Following this, not related to this PR, I've been quite confused why we chose to do recovery before we construct Catalog and StorageManager objects. (I kinda cannot remember why we have to do that) Is this still a must-to-go design now?

src/include/storage/store/node_table.h Outdated Show resolved Hide resolved
src/storage/wal_replayer.cpp Show resolved Hide resolved
src/storage/wal_replayer.cpp Show resolved Hide resolved
src/storage/wal_replayer.cpp Outdated Show resolved Hide resolved
auto tableSchema = catalogForCheckpointing->getReadOnlyVersion()->getTableSchema(tableID);
auto property = tableSchema->getProperty(propertyID);
if (tableSchema->isNodeTable) {
WALReplayerUtils::initPropertyMetaDAsOnDisk(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit wrong to me. I was expecting you to go through the regular WAL version of pages mechanisms to checkpoint the necessary pages on disk. And then call something like a "checkpointInMemory()" function on NodeTable to create the NodeColumns (and in the InMemDiskArrays) that are being created as part of the add property DDL statement. I am even curious if we are already doing the WAL version of pages way of checkpointing as well as WALReplayerUtils::initPropertyMetaDAsOnDisk. It's not obvious to me that we are not doing such "double checkpointing" on disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm separating this change into another PR.

Copy link
Contributor

@semihsalihoglu-uw semihsalihoglu-uw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the rest of my comments. Let's do another iteration and also discuss certain things in person if you need.

src/include/storage/wal_replayer_utils.h Outdated Show resolved Hide resolved
src/include/storage/wal_replayer_utils.h Outdated Show resolved Hide resolved
src/storage/wal_replayer.cpp Outdated Show resolved Hide resolved
src/storage/wal_replayer.cpp Outdated Show resolved Hide resolved
src/storage/wal_replayer.cpp Outdated Show resolved Hide resolved
src/storage/storage_structure/node_column.cpp Outdated Show resolved Hide resolved
src/storage/storage_structure/node_column.cpp Outdated Show resolved Hide resolved
@@ -97,6 +132,20 @@ void NodeTable::prepareRollback() {
}
}

void NodeTable::checkpointInMemory() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this code is also used when there is a copy to a node table, so all properties and the PK are being updated. But if I understand correctly, it is also being used when a new property was added, right? So in that case as well, we go through this coarse way of checkpointing each component of a NodeTable in memory. In the wal_replayer we should probably have a mechanism to keep track of not just the nodeTables but individual properties that require inmemory checkpointing.

You don't have to do it here but adding it here to record this problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should keep track of these directly inside Transaction, which seems easier and makes more sense to me.

src/storage/store/struct_column_chunk.cpp Outdated Show resolved Hide resolved
src/storage/store/var_sized_column_chunk.cpp Outdated Show resolved Hide resolved
@ray6080 ray6080 force-pushed the node-group branch 2 times, most recently from df1bb2c to b7352e4 Compare August 1, 2023 07:05
@codecov
Copy link

codecov bot commented Aug 1, 2023

Codecov Report

Patch coverage: 82.39% and project coverage change: -1.49% ⚠️

Comparison is base (416d392) 91.11% compared to head (cd99fab) 89.62%.

❗ Current head cd99fab differs from pull request most recent head d03b06a. Consider uploading reports for the commit d03b06a to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1802      +/-   ##
==========================================
- Coverage   91.11%   89.62%   -1.49%     
==========================================
  Files         813      821       +8     
  Lines       29317    30160     +843     
==========================================
+ Hits        26711    27032     +321     
- Misses       2606     3128     +522     
Files Changed Coverage Δ
src/common/file_utils.cpp 77.64% <ø> (+3.48%) ⬆️
src/include/catalog/catalog.h 100.00% <ø> (ø)
src/include/common/file_utils.h 100.00% <ø> (ø)
src/include/common/types/types.h 100.00% <ø> (ø)
src/include/common/vector/value_vector.h 100.00% <ø> (ø)
..._plan/logical_operator/logical_create_node_table.h 100.00% <ø> (ø)
src/include/processor/operator/ddl/add_property.h 78.57% <0.00%> (-21.43%) ⬇️
...include/processor/operator/ddl/create_node_table.h 100.00% <ø> (ø)
src/include/storage/copier/npy_reader.h 100.00% <ø> (ø)
src/include/storage/file_handle.h 100.00% <ø> (ø)
... and 67 more

... and 50 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/storage/wal_replayer.cpp Show resolved Hide resolved
src/processor/operator/copy/copy_node.cpp Outdated Show resolved Hide resolved
src/catalog/catalog.cpp Outdated Show resolved Hide resolved
src/catalog/table_schema.cpp Outdated Show resolved Hide resolved
src/include/storage/wal_replayer_utils.h Show resolved Hide resolved
protected:
std::unique_ptr<CatalogContent> catalogContentForReadOnlyTrx;
std::unique_ptr<CatalogContent> catalogContentForWriteTrx;
storage::WAL* wal;
std::unique_ptr<storage::BMFileHandle> nodeGroupsMetaFH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Catalog clearly is not a place to keep track of disk-related fields, such as filehandles. In addition this field already exists (as raw pointer) in node_table.h, which makes sense.

@ray6080 ray6080 force-pushed the node-group branch 2 times, most recently from df07db5 to 00eacad Compare August 2, 2023 07:55
@ray6080 ray6080 merged commit baf9e56 into master Aug 2, 2023
8 checks passed
@ray6080 ray6080 deleted the node-group branch August 2, 2023 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants