Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate file header for LOAD and COPY #2210

Merged
merged 1 commit into from
Oct 14, 2023
Merged

Validate file header for LOAD and COPY #2210

merged 1 commit into from
Oct 14, 2023

Conversation

andyfengHKU
Copy link
Contributor

This PR partially solves issue #2139.

We add several validation to COPY and LOAD FROM

  • number of columns should match DDL or specified header.
  • for parquet, column type should match DDL or specified header. Note that this is not the final solution. Eventually we should allow casting inside parquet reader.

src/binder/bind/bind_copy.cpp Show resolved Hide resolved
test/test_files/exceptions/copy/wrong_header.test Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (3280465) 89.46% compared to head (80e9905) 89.58%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2210      +/-   ##
==========================================
+ Coverage   89.46%   89.58%   +0.12%     
==========================================
  Files        1007     1007              
  Lines       36241    36249       +8     
==========================================
+ Hits        32422    32475      +53     
+ Misses       3819     3774      -45     
Files Coverage Δ
src/binder/bind/bind_copy.cpp 94.83% <100.00%> (+2.94%) ⬆️
src/include/binder/binder.h 100.00% <ø> (ø)
...r/operator/persistent/reader/csv/base_csv_reader.h 100.00% <ø> (ø)
.../processor/operator/persistent/reader/csv/driver.h 100.00% <ø> (ø)
...operator/persistent/reader/csv/base_csv_reader.cpp 100.00% <ø> (ø)
...rocessor/operator/persistent/reader/csv/driver.cpp 97.12% <100.00%> (+0.68%) ⬆️
src/binder/bind/bind_reading_clause.cpp 95.65% <95.91%> (+15.99%) ⬆️

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/binder/bind/bind_copy.cpp Outdated Show resolved Hide resolved
src/binder/bind/bind_copy.cpp Outdated Show resolved Hide resolved
src/binder/bind/bind_reading_clause.cpp Show resolved Hide resolved
src/binder/bind/bind_reading_clause.cpp Show resolved Hide resolved
src/binder/bind/bind_copy.cpp Show resolved Hide resolved
src/binder/bind/bind_reading_clause.cpp Outdated Show resolved Hide resolved
src/binder/bind/bind_reading_clause.cpp Outdated Show resolved Hide resolved
Comment on lines 211 to 240
void Binder::sniffFiles(const common::ReaderConfig& readerConfig,
std::vector<std::string>& columnNames,
std::vector<std::unique_ptr<common::LogicalType>>& columnTypes) {
assert(readerConfig.getNumFiles() > 0);
sniffFile(readerConfig, 0, columnNames, columnTypes);
for (auto i = 1; i < readerConfig.getNumFiles(); ++i) {
std::vector<std::string> tmpColumnNames;
std::vector<std::unique_ptr<LogicalType>> tmpColumnTypes;
sniffFile(readerConfig, i, tmpColumnNames, tmpColumnTypes);
switch (readerConfig.fileType) {
case FileType::CSV: {
validateNumColumns(columnTypes.size(), tmpColumnTypes.size());
}
case FileType::PARQUET: {
validateNumColumns(columnTypes.size(), tmpColumnTypes.size());
validateColumnTypes(columnNames, columnTypes, tmpColumnTypes);
} break;
case FileType::NPY: {
validateNumColumns(1, tmpColumnTypes.size());
columnNames.push_back(tmpColumnNames[0]);
columnTypes.push_back(tmpColumnTypes[0]->copy());
} break;
default:
break;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some duplication here, right? Either we should validate the number of columns here, or in each bind step, but not both.

I think it'd be best to move the validation entirely to a separate function to consolidate the shared logic.

src/binder/bind/bind_reading_clause.cpp Outdated Show resolved Hide resolved
src/binder/bind/bind_reading_clause.cpp Show resolved Hide resolved
reader/csv: skip empty lines when sniffing

On CSVs without headers, we should skip any leading empty lines, and
return zero if all lines are empty.

Co-authored-by: Keenan G <41458184+Riolku@users.noreply.github.com>
@ray6080 ray6080 merged commit 0e3f995 into master Oct 14, 2023
11 checks passed
@ray6080 ray6080 deleted the issue-2139 branch October 14, 2023 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants