Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unchecked::utf16to8 reads out of bounds if provided only leading surrogate #78

Closed
jimon opened this issue Feb 24, 2021 · 5 comments
Closed

Comments

@jimon
Copy link

jimon commented Feb 24, 2021

In case if provided an array of a form [..., 0xd800u], then utf16to8 will try to read trailing surrogate without even checking if we went outside of the array. This leads the following while (start != end) to never stop until the code tries to read unmapped memory and segfaults.

I know that it's unchecked, but segfaulting on invalid input is a pretty grim failure mode :)

@nemtrif
Copy link
Owner

nemtrif commented Feb 25, 2021

Well, unchecked means exactly that - no checks and segfaults.

@nemtrif nemtrif closed this as completed Feb 25, 2021
@jimon
Copy link
Author

jimon commented Feb 25, 2021

The function is given the start and the end iterators, and the function walks over the end iterator. Sure you can generate whatever output because as you say it's unchecked, but failing to check for end iterator is poor engineering. For example something like wide char strlen wcslen would not go over NULL symbol even if data contains a broken UTF-16 sequence. We're using utfcpp at Unity, so it would be great to be able to use the upstream version, otherwise I'll have to fork and patch it. I can make a PR with a fix if you would like to merge it.

@nemtrif
Copy link
Owner

nemtrif commented Feb 27, 2021

The purpose of the unchecked namespace is to avoid paying price for safety checks if and only if you are 100% sure there is no invalid UTF-8 of any kind. In practice, that means the text you produce one way or another (i.e. string literals or text resources) or text that was checked by something like is_valid() function.

What you are asking for is something between checked and unchecked behavior. Nothing wrong with that - I understand some checks may be more important in some scenarios than others, but I am not penalizing users of unchecked namespace who are using it as it was intended.

If you want to fork utfcpp and make only the checks you want, that's fine; my suggestion would be to start from the checked version and remove checks you don't care about, rather than adding checks to unchecked functions.

@ceztko
Copy link

ceztko commented Feb 13, 2023

I stumbled upon this as well. For lurkers, I coded a utf8::utf16to8_lenient function that won't crash when going out of bounds, and just stop on first invalid codepoint read.

@ruifig
Copy link

ruifig commented Sep 19, 2023

I've just hit this too, and here are my thoughts...

FYI, my use of utfcpp is very limited. I literally only need to convert to and from utf8, ideally without having to deal with exceptions (it's for a cross-platform SDK), so I haven't gone through the entire API yet to know of any workarounds.

  • I don't think there is a "is_valid" function to test utf16 (might be wrong). If that's the case, then it's problematic for someone who doesn't want to use any checked function due to the risk of exceptions.
  • unchecked::utf8to16 doesn't crash given invalid inputs, so I wasn't expecting unchecked::utf16to8 to crash either. The problem seems to be that utf16to8 increments the iterator in two places inside the loop, but only checks against "end" once.

I think a simple check after the is_lead_surrogate (can't be before) would fix this. E.g:

            // Take care of surrogate pairs first
                if (utf8::internal::is_lead_surrogate(cp)) {
                    if (start >= end)
                        return result;

I'm happy to use utfcpp as-is and make any changes I required. These are just my thoughts.

@nemtrif nemtrif reopened this Sep 24, 2023
nemtrif added a commit that referenced this issue Sep 24, 2023
@nemtrif nemtrif closed this as completed Sep 24, 2023
nemtrif added a commit that referenced this issue Oct 28, 2023
Fixing regression caused by the fix for #78, which leads to utf8::unchecked::utf16to8() chopping off the last character in many cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants