Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow rel copier #1154

Merged
merged 1 commit into from
Jan 15, 2023
Merged

Arrow rel copier #1154

merged 1 commit into from
Jan 15, 2023

Conversation

weipang142857
Copy link
Contributor

Implemented arrow rel copier
Refactored code of both arrow rel copier and arrow node copier
Deleted old rel copier
Changed some test files to accommodate CSV standard

Copy link
Contributor

@semihsalihoglu-uw semihsalihoglu-uw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start. I have a few medium-level suggestions, which you should address. Let's also discuss the one about the "list delimiter".

Also: why are there no tests using parquet and arrow in this PR? I only see tests using csv. Those should also be added.

dataset/long-string-pk-tests/eKnows.csv Show resolved Hide resolved
dataset/long-string-pk-tests/eKnows.csv Show resolved Hide resolved
src/include/storage/in_mem_csv_copier/copy_csv_task.h Outdated Show resolved Hide resolved
src/processor/operator/copy_csv/copy_rel_csv.cpp Outdated Show resolved Hide resolved
src/storage/in_mem_csv_copier/in_mem_structures_copier.cpp Outdated Show resolved Hide resolved
src/storage/in_mem_csv_copier/in_mem_arrow_rel_copier.cpp Outdated Show resolved Hide resolved
src/storage/in_mem_csv_copier/in_mem_arrow_rel_copier.cpp Outdated Show resolved Hide resolved
src/storage/in_mem_csv_copier/in_mem_arrow_rel_copier.cpp Outdated Show resolved Hide resolved
src/storage/in_mem_csv_copier/in_mem_arrow_rel_copier.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@semihsalihoglu-uw semihsalihoglu-uw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a small list of comments but one of them (the possible bug about unique_pointer. I want to see that fixed and also look at the code that fixes that bug. Other than that, there are a few minor comments.

src/include/common/csv_reader/csv_reader.h Outdated Show resolved Hide resolved
src/include/common/csv_reader/csv_reader.h Outdated Show resolved Hide resolved
const string filePath;
const CSVReaderConfig csvReaderConfig;
shared_ptr<ReaderConfig> readerConfig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a shared_ptr? I think it should be a unique_ptr and you can share the raw pointer if you have to. I see you say this:

Using unique_ptr leads to some errors, which indicates that it relates to calling the destructor of CopyDescription, I think it's because of using std::move on CopyDescription in the constructor of some other class when initializing. Using shared_ptr avoids the error.

Whatever it is, we need to first understand why unique_ptr does not work. It could be a bug. We can't fix a problem without understanding it. So whatever the problem, first let's understand it and find a fix (I don't think there is any good reason for this to be shared_ptr.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be a bug. There are a few places in the code base that copies the CopyDescription object, which leads to copying the unique_ptr. Since unique_ptr does not allow having multiple pointers pointing to the same object, it does not have a copy constructor. Therefore, an error about calling an implicitly-deleted copy constructor is triggered. Using shared_ptr avoids this issue.

src/include/common/csv_reader/csv_reader.h Outdated Show resolved Hide resolved
src/include/common/configs.h Outdated Show resolved Hide resolved
src/storage/copy_arrow/copy_rel_arrow.cpp Outdated Show resolved Hide resolved
void InMemArrowRelCopier::putPropsOfLineIntoLists(InMemArrowRelCopier* copier,
vector<PageByteCursor>& inMemOverflowFileCursors, const vector<shared_ptr<T>>& batchColumns,
const vector<nodeID_t>& nodeIDs, const vector<uint64_t>& reversePos, int64_t blockOffset,
int64_t& colIndex, char delimiter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me clarify what I was referring to when saying it was a hack. In initCSVReader there is this line:

arrowParse.delimiter = copyDescription.readerConfig->tokenSeparator;

Why is delimter being set to tokenSeparator? This looks like a hack. tokenSeparator is a CSV-specific option we are giving. It has nothing to do with the List data type. So again, let's discuss this.

Update: I think I now understand why I was confused. The reason is ReaderConfig::tokenSeparator is a confusing name because it should be called ReaderConfig::delimiter. I left a comment on this in csv_reader.h.

src/include/common/csv_reader/csv_reader.h Outdated Show resolved Hide resolved
Copy link
Contributor

@semihsalihoglu-uw semihsalihoglu-uw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job.

CopyDescription::CopyDescription(const CopyDescription& copyDescription)
: filePath{copyDescription.filePath}, readerConfig{nullptr}, fileType{
copyDescription.fileType} {
if (fileType == FileType::CSV) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for this null or not condition directly on the readerConfig field. So something like:

if (copyDescription.readerConfig) {
  this->readerConfig = make_unique<CSVReaderConfig>(*copyDescription.readerConfig);
}

@weipang142857 weipang142857 merged commit 4cf8afc into master Jan 15, 2023
@weipang142857 weipang142857 deleted the arrow_rel branch January 15, 2023 11:20
@ray6080 ray6080 changed the title Implementation of arrow rel copier Arrow rel copier Jan 29, 2023
@ray6080 ray6080 added the feature New features or missing components of existing features label Jan 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New features or missing components of existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants