Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor copy node #1590

Merged
merged 1 commit into from
Jun 6, 2023
Merged

Refactor copy node #1590

merged 1 commit into from
Jun 6, 2023

Conversation

acquamarin
Copy link
Collaborator

@acquamarin acquamarin commented May 29, 2023

  1. Refactors the copy node pipeline, so we can reuse the existing processor task pipeline to execute copy node task.
  2. CopyNode queries will be compiled to two pipelines where the first pipeline does the actual copy and the second pipeline returns the copy message.
First pipeline: read_file(either read_csv or read_parquet) - copy_node(takes in arrow batch and copies them to database)
Second pipeline:  FactorizedTableScan(scans the ftable which only contains the copy message) - ResultCollector(collect copy message)

@ray6080 ray6080 self-requested a review May 29, 2023 22:22
@codecov
Copy link

codecov bot commented May 29, 2023

Codecov Report

Patch coverage: 97.47% and project coverage change: +0.10 🎉

Comparison is base (0248f05) 91.66% compared to head (011255c) 91.76%.

❗ Current head 011255c differs from pull request most recent head e524ab5. Consider uploading reports for the commit e524ab5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1590      +/-   ##
==========================================
+ Coverage   91.66%   91.76%   +0.10%     
==========================================
  Files         716      714       -2     
  Lines       26042    25907     -135     
==========================================
- Hits        23871    23774      -97     
+ Misses       2171     2133      -38     
Impacted Files Coverage Δ
src/include/common/types/types.h 100.00% <ø> (ø)
src/include/common/vector/value_vector.h 100.00% <ø> (ø)
src/include/processor/mapper/plan_mapper.h 100.00% <ø> (ø)
src/include/processor/operator/physical_operator.h 100.00% <ø> (ø)
src/include/processor/operator/result_collector.h 100.00% <ø> (ø)
src/include/processor/processor.h 100.00% <ø> (ø)
src/include/processor/result/factorized_table.h 96.77% <ø> (ø)
src/include/processor/operator/copy/copy_npy.h 57.14% <57.14%> (ø)
src/processor/operator/copy/copy_npy.cpp 75.00% <75.00%> (ø)
src/processor/mapper/map_ddl.cpp 97.67% <94.73%> (-2.33%) ⬇️
... and 24 more

... and 78 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/include/processor/operator/copy/read_file.h Outdated Show resolved Hide resolved
src/include/processor/operator/copy/read_file.h Outdated Show resolved Hide resolved
src/include/common/vector/auxiliary_buffer.h Outdated Show resolved Hide resolved
src/include/processor/operator/copy/copy_node.h Outdated Show resolved Hide resolved
src/processor/mapper/map_ddl.cpp Outdated Show resolved Hide resolved
src/processor/result/factorized_table.cpp Outdated Show resolved Hide resolved
src/processor/result/factorized_table.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think maybe we should separate changes on ValueVector (arrow aux) and Copy operators. Both require more works as I can see. We can get the ValueVector part done first.

@@ -249,6 +249,9 @@ void LogicalType::setPhysicalType() {
case LogicalTypeID::STRUCT: {
physicalType = PhysicalTypeID::STRUCT;
} break;
case LogicalTypeID::ARROW_DATA: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't like a good way to differentiate arrow array from others. Can we introduce something like a AuxiliaryDataType?

src/include/common/vector/auxiliary_buffer.h Outdated Show resolved Hide resolved
src/include/common/vector/auxiliary_buffer.h Outdated Show resolved Hide resolved
src/planner/operator/logical_copy.cpp Outdated Show resolved Hide resolved
test/runner/e2e_copy_transaction_test.cpp Show resolved Hide resolved
@@ -87,6 +87,9 @@ uint32_t ValueVector::getDataTypeSize(const LogicalType& type) {
case LogicalTypeID::VAR_LIST: {
return sizeof(list_entry_t);
}
case LogicalTypeID::ARROW_COLUMN: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think this is a good idea. It doesn't make sense to me to have ARROW column/array as a logical type.

src/include/common/vector/auxiliary_buffer.h Outdated Show resolved Hide resolved
@@ -1,5 +1,6 @@
#pragma once

#include "arrow/array.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think of a way to get rid of this include. I hope we can go through common/arrow. One way is to have a zero-copy conversion in copier from arrow's arrow_array to common's arrow_array.

@@ -43,6 +44,14 @@ class StructAuxiliaryBuffer : public AuxiliaryBuffer {
std::vector<std::shared_ptr<ValueVector>> childrenVectors;
};

class ArrowColumnAuxiliaryBuffer : public AuxiliaryBuffer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way to implement the arrow auxiliary is that we store the arrow array in the aux to keep its lifetime, and point to the aux array when possible from ValueVector. Let's discuss this a bit more offline.

src/processor/mapper/map_ddl.cpp Show resolved Hide resolved
@acquamarin acquamarin merged commit 4c5e1be into master Jun 6, 2023
5 of 6 checks passed
@acquamarin acquamarin deleted the copy-refactor branch June 6, 2023 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants