-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw copy exception on invalid utf8 string #2208
Conversation
da30e53
to
3f21e44
Compare
3f21e44
to
11f3a52
Compare
@@ -340,6 +341,9 @@ void copyStringToVector(ValueVector* vector, uint64_t rowToAdd, std::string_view | |||
vector, rowToAdd, reinterpret_cast<char*>(blobBuffer.get()), blobLen); | |||
} break; | |||
case LogicalTypeID::STRING: { | |||
if (!utf8proc::Utf8Proc::isValid(strVal.data(), strVal.length())) { | |||
throw common::CopyException{"Invalid UTF8-encoded string."}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also print the strVal in error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think printing the invalid utf8 characters is undefined behaviour and causes problems when writing the exception test for utf8 strings.
If you add |
I'm fairly certain CSV reading isn't the only place invalid UTF-8 can enter a string column. We need to make sure that it every cast function fails on invalid UTF-8. This is yet another reason to create a harmonized type system (besides better support across all parts of the system). |
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2208 +/- ##
==========================================
- Coverage 89.43% 89.42% -0.01%
==========================================
Files 1007 1007
Lines 36317 36319 +2
==========================================
Hits 32479 32479
- Misses 3838 3840 +2
☔ View full report in Codecov by Sentry. |
See #2148 for why this PR is not sufficient. We should probably aim to fix this before the next major release, it's not a small task in my opinion. |
Closes #2148