-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unchecked::utf16to8 reads out of bounds if provided only leading surrogate #78
Comments
Well, unchecked means exactly that - no checks and segfaults. |
The function is given the start and the end iterators, and the function walks over the end iterator. Sure you can generate whatever output because as you say it's unchecked, but failing to check for end iterator is poor engineering. For example something like wide char strlen |
The purpose of the unchecked namespace is to avoid paying price for safety checks if and only if you are 100% sure there is no invalid UTF-8 of any kind. In practice, that means the text you produce one way or another (i.e. string literals or text resources) or text that was checked by something like is_valid() function. What you are asking for is something between checked and unchecked behavior. Nothing wrong with that - I understand some checks may be more important in some scenarios than others, but I am not penalizing users of unchecked namespace who are using it as it was intended. If you want to fork utfcpp and make only the checks you want, that's fine; my suggestion would be to start from the checked version and remove checks you don't care about, rather than adding checks to unchecked functions. |
I stumbled upon this as well. For lurkers, I coded a |
I've just hit this too, and here are my thoughts... FYI, my use of utfcpp is very limited. I literally only need to convert to and from utf8, ideally without having to deal with exceptions (it's for a cross-platform SDK), so I haven't gone through the entire API yet to know of any workarounds.
I think a simple check after the is_lead_surrogate (can't be before) would fix this. E.g:
I'm happy to use utfcpp as-is and make any changes I required. These are just my thoughts. |
Fixing regression caused by the fix for #78, which leads to utf8::unchecked::utf16to8() chopping off the last character in many cases.
In case if provided an array of a form [..., 0xd800u], then utf16to8 will try to read trailing surrogate without even checking if we went outside of the array. This leads the following
while (start != end)
to never stop until the code tries to read unmapped memory and segfaults.I know that it's
unchecked
, but segfaulting on invalid input is a pretty grim failure mode :)The text was updated successfully, but these errors were encountered: