Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw copy exception on invalid utf8 string #2208

Merged
merged 1 commit into from
Oct 13, 2023
Merged

Throw copy exception on invalid utf8 string #2208

merged 1 commit into from
Oct 13, 2023

Conversation

acquamarin
Copy link
Collaborator

@acquamarin acquamarin commented Oct 13, 2023

Closes #2148

@@ -340,6 +341,9 @@ void copyStringToVector(ValueVector* vector, uint64_t rowToAdd, std::string_view
vector, rowToAdd, reinterpret_cast<char*>(blobBuffer.get()), blobLen);
} break;
case LogicalTypeID::STRING: {
if (!utf8proc::Utf8Proc::isValid(strVal.data(), strVal.length())) {
throw common::CopyException{"Invalid UTF8-encoded string."};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also print the strVal in error message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think printing the invalid utf8 characters is undefined behaviour and causes problems when writing the exception test for utf8 strings.

@Riolku
Copy link
Contributor

Riolku commented Oct 13, 2023

If you add Closes #2148 to your commit body, you needn't manually close the linked issue on merge.

@Riolku
Copy link
Contributor

Riolku commented Oct 13, 2023

I'm fairly certain CSV reading isn't the only place invalid UTF-8 can enter a string column. We need to make sure that it every cast function fails on invalid UTF-8.

This is yet another reason to create a harmonized type system (besides better support across all parts of the system).

@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (faa5f98) 89.43% compared to head (11f3a52) 89.42%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2208      +/-   ##
==========================================
- Coverage   89.43%   89.42%   -0.01%     
==========================================
  Files        1007     1007              
  Lines       36317    36319       +2     
==========================================
  Hits        32479    32479              
- Misses       3838     3840       +2     
Files Coverage Δ
...rocessor/operator/persistent/reader/csv/driver.cpp 95.66% <100.00%> (+0.03%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Riolku
Copy link
Contributor

Riolku commented Oct 13, 2023

See #2148 for why this PR is not sufficient. We should probably aim to fix this before the next major release, it's not a small task in my opinion.

@acquamarin acquamarin merged commit 3280465 into master Oct 13, 2023
11 checks passed
@acquamarin acquamarin deleted the utf8-check branch October 13, 2023 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upper and Lower do not handle invalid UTF-8
3 participants