unchecked::utf16to8 reads out of bounds if provided only leading surrogate #78

jimon · 2021-02-24T17:57:01Z

In case if provided an array of a form [..., 0xd800u], then utf16to8 will try to read trailing surrogate without even checking if we went outside of the array. This leads the following while (start != end) to never stop until the code tries to read unmapped memory and segfaults.

I know that it's unchecked, but segfaulting on invalid input is a pretty grim failure mode :)

The text was updated successfully, but these errors were encountered:

nemtrif · 2021-02-25T22:30:04Z

Well, unchecked means exactly that - no checks and segfaults.

jimon · 2021-02-25T23:13:57Z

The function is given the start and the end iterators, and the function walks over the end iterator. Sure you can generate whatever output because as you say it's unchecked, but failing to check for end iterator is poor engineering. For example something like wide char strlen wcslen would not go over NULL symbol even if data contains a broken UTF-16 sequence. We're using utfcpp at Unity, so it would be great to be able to use the upstream version, otherwise I'll have to fork and patch it. I can make a PR with a fix if you would like to merge it.

nemtrif · 2021-02-27T00:12:28Z

The purpose of the unchecked namespace is to avoid paying price for safety checks if and only if you are 100% sure there is no invalid UTF-8 of any kind. In practice, that means the text you produce one way or another (i.e. string literals or text resources) or text that was checked by something like is_valid() function.

What you are asking for is something between checked and unchecked behavior. Nothing wrong with that - I understand some checks may be more important in some scenarios than others, but I am not penalizing users of unchecked namespace who are using it as it was intended.

If you want to fork utfcpp and make only the checks you want, that's fine; my suggestion would be to start from the checked version and remove checks you don't care about, rather than adding checks to unchecked functions.

ceztko · 2023-02-13T22:09:39Z

I stumbled upon this as well. For lurkers, I coded a utf8::utf16to8_lenient function that won't crash when going out of bounds, and just stop on first invalid codepoint read.

ruifig · 2023-09-19T14:38:31Z

I've just hit this too, and here are my thoughts...

FYI, my use of utfcpp is very limited. I literally only need to convert to and from utf8, ideally without having to deal with exceptions (it's for a cross-platform SDK), so I haven't gone through the entire API yet to know of any workarounds.

I don't think there is a "is_valid" function to test utf16 (might be wrong). If that's the case, then it's problematic for someone who doesn't want to use any checked function due to the risk of exceptions.
unchecked::utf8to16 doesn't crash given invalid inputs, so I wasn't expecting unchecked::utf16to8 to crash either. The problem seems to be that utf16to8 increments the iterator in two places inside the loop, but only checks against "end" once.

I think a simple check after the is_lead_surrogate (can't be before) would fix this. E.g:

            // Take care of surrogate pairs first
                if (utf8::internal::is_lead_surrogate(cp)) {
                    if (start >= end)
                        return result;

I'm happy to use utfcpp as-is and make any changes I required. These are just my thoughts.

…d only leading surrogate

Fixing regression caused by the fix for #78, which leads to utf8::unchecked::utf16to8() chopping off the last character in many cases.

nemtrif closed this as completed Feb 25, 2021

nemtrif reopened this Sep 24, 2023

nemtrif added a commit that referenced this issue Sep 24, 2023

Fix for issue #78: unchecked::utf16to8 reads out of bounds if provide…

0ee84da

…d only leading surrogate

nemtrif closed this as completed Sep 24, 2023

nemtrif mentioned this issue Oct 28, 2023

utf8::utf16to8 and utf8::unchecked::utf16to8 produce different output #111

Closed

nemtrif added a commit that referenced this issue Oct 28, 2023

Fix for issue #111

972c550

Fixing regression caused by the fix for #78, which leads to utf8::unchecked::utf16to8() chopping off the last character in many cases.

afh mentioned this issue Dec 7, 2023

6 python tests failing in 3.3.0 ledger/ledger#2302

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unchecked::utf16to8 reads out of bounds if provided only leading surrogate #78

unchecked::utf16to8 reads out of bounds if provided only leading surrogate #78

jimon commented Feb 24, 2021

nemtrif commented Feb 25, 2021

jimon commented Feb 25, 2021 •

edited

Loading

nemtrif commented Feb 27, 2021

ceztko commented Feb 13, 2023

ruifig commented Sep 19, 2023

unchecked::utf16to8 reads out of bounds if provided only leading surrogate #78

unchecked::utf16to8 reads out of bounds if provided only leading surrogate #78

Comments

jimon commented Feb 24, 2021

nemtrif commented Feb 25, 2021

jimon commented Feb 25, 2021 • edited Loading

nemtrif commented Feb 27, 2021

ceztko commented Feb 13, 2023

ruifig commented Sep 19, 2023

jimon commented Feb 25, 2021 •

edited

Loading