bpo-36819: Fix out-of-bounds writes in encoders #13134

ATalaba · 2019-05-06T19:24:31Z

The utf_16 and utf_32 encoders (found in _PyUnicode_EncodeUTF16 and
_PyUnicode_EncodeUTF32, respectively) don't properly re-size the output
buffer. This leads to out-of-bounds writes, and segfaults.

This change ensures that the encoder will re-allocate the buffer even
when the general error handler only writes one codepoint, and will
allocate enough extra memory to fit the full result.

https://bugs.python.org/issue36819

the-knights-who-say-ni · 2019-05-06T19:24:33Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

serhiy-storchaka · 2019-05-06T19:44:13Z

Could you please add tests?

fried · 2019-05-06T20:27:37Z

The repro in the bpo no longer aborts python with these fixes, but the examples cause infinite loops.

Allowing the error to return a continue position at where it fixed is probably a bug that we should detect.

ATalaba · 2019-05-06T20:46:47Z

All (or at least all that I've checked) builtin encoder/decoders trust the error handlers they dispatch to enough that they just set their continue position to whatever the handler returns (which is documented on the codecs doc page). Decoders trust their error handlers enough that error handlers are allowed to change the bytes object being decoded.

I figured this was the desired behavior and made the PR to match it. A fix that tries to prevent potential OOMs throughout codecs will require a change to the full functionality of error handlers.

I personally don't see a reason to allow user-defined error handlers the amount of power that they currently have, but I thought that a change of that scope would require a lot more discussion, and that it'd be more important to first stop the interpreter from writing to un-allocated memory.

The utf_16 and utf_32 encoders (found in _PyUnicode_EncodeUTF16 and _PyUnicode_EncodeUTF32, respectively) don't properly re-size the output buffer. This leads to out-of-bounds writes, and segfaults. This change ensures that the encoder will re-allocate the buffer even when the general error handler only writes one codepoint, and will allocate enough extra memory to fit the full result.

serhiy-storchaka · 2019-05-07T07:23:39Z

Thank you. I am not sure the fix is correct. It overallocates more memory than necessary with "good" error handlers. This leads to OOM with "bad" error handler.

I think that the root bug is that encoders allow an infinite loop. After fixing this the out-of-bounds writes may be fixed too.

csabella · 2020-01-16T09:56:06Z

@ATalaba, please address Serhiy's last comments. Thank you.

vstinner

LGTM, but I would prefer if @serhiy-storchaka, our UTF-16 and UTF-32 codec expert :-D, could also review the change.

I also proposed a minor enhancement.

vstinner · 2021-09-20T07:46:12Z

Lib/test/test_codecs.py

+            return ("a", exc.end if state.check() == 50 else exc.start)
+
+        codecs.register_error("err", err)
+        codecs.utf_32_encode("\udc80", "err")


You can maybe add a test on len(State.num). I expect an exact number of calls.

Same remark for the utf_16_encode() test.

zooba

I'm not Serhiy, but it looks good to me (bit obscure how we handle incorrectly-sized bytes results from the error handler, but it checks out).

vstinner · 2021-09-24T11:12:22Z

@ATalaba: Would you mind to address my remark and rebase your PR to fix the merge conflict?

serhiy-storchaka · 2021-09-28T09:06:44Z

See #28593 for more complete solution.

serhiy-storchaka · 2022-05-02T16:57:31Z

Fixed in #28593.

the-knights-who-say-ni added the CLA not signed label May 6, 2019

bedevere-bot added the awaiting review label May 6, 2019

the-knights-who-say-ni added CLA signed and removed CLA not signed labels May 6, 2019

ATalaba force-pushed the fix-encodings-segfault branch from 26b1445 to 342643b Compare May 6, 2019 20:50

ATalaba and others added 2 commits May 6, 2019 16:55

Added tests for utf-16 and utf-32 behavior

8f03572

📜🤖 Added by blurb_it.

1c843b5

csabella added awaiting changes and removed awaiting review labels Jan 17, 2020

vstinner approved these changes Sep 20, 2021

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting changes labels Sep 20, 2021

zooba approved these changes Sep 23, 2021

View reviewed changes

serhiy-storchaka self-requested a review September 27, 2021 12:24

serhiy-storchaka closed this May 2, 2022

ATalaba mannequin mentioned this pull request Apr 10, 2022

Crash during encoding using UTF-16/32 and custom error handler #81000

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-36819: Fix out-of-bounds writes in encoders #13134

bpo-36819: Fix out-of-bounds writes in encoders #13134

ATalaba commented May 6, 2019 •

edited by bedevere-bot

Loading

the-knights-who-say-ni commented May 6, 2019

serhiy-storchaka commented May 6, 2019

fried commented May 6, 2019

ATalaba commented May 6, 2019

serhiy-storchaka commented May 7, 2019

csabella commented Jan 16, 2020

vstinner left a comment

vstinner Sep 20, 2021

zooba left a comment

vstinner commented Sep 24, 2021

serhiy-storchaka commented Sep 28, 2021

serhiy-storchaka commented May 2, 2022

bpo-36819: Fix out-of-bounds writes in encoders #13134

bpo-36819: Fix out-of-bounds writes in encoders #13134

Conversation

ATalaba commented May 6, 2019 • edited by bedevere-bot Loading

the-knights-who-say-ni commented May 6, 2019

serhiy-storchaka commented May 6, 2019

fried commented May 6, 2019

ATalaba commented May 6, 2019

serhiy-storchaka commented May 7, 2019

csabella commented Jan 16, 2020

vstinner left a comment

Choose a reason for hiding this comment

vstinner Sep 20, 2021

Choose a reason for hiding this comment

zooba left a comment

Choose a reason for hiding this comment

vstinner commented Sep 24, 2021

serhiy-storchaka commented Sep 28, 2021

serhiy-storchaka commented May 2, 2022

ATalaba commented May 6, 2019 •

edited by bedevere-bot

Loading