Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-36819: Fix out-of-bounds writes in encoders #13134

Closed
wants to merge 3 commits into from

Conversation

ATalaba
Copy link

@ATalaba ATalaba commented May 6, 2019

The utf_16 and utf_32 encoders (found in _PyUnicode_EncodeUTF16 and
_PyUnicode_EncodeUTF32, respectively) don't properly re-size the output
buffer. This leads to out-of-bounds writes, and segfaults.

This change ensures that the encoder will re-allocate the buffer even
when the general error handler only writes one codepoint, and will
allocate enough extra memory to fit the full result.

https://bugs.python.org/issue36819

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

@serhiy-storchaka
Copy link
Member

Could you please add tests?

@fried
Copy link
Contributor

fried commented May 6, 2019

The repro in the bpo no longer aborts python with these fixes, but the examples cause infinite loops.

Allowing the error to return a continue position at where it fixed is probably a bug that we should detect.

@ATalaba
Copy link
Author

ATalaba commented May 6, 2019

All (or at least all that I've checked) builtin encoder/decoders trust the error handlers they dispatch to enough that they just set their continue position to whatever the handler returns (which is documented on the codecs doc page). Decoders trust their error handlers enough that error handlers are allowed to change the bytes object being decoded.

I figured this was the desired behavior and made the PR to match it. A fix that tries to prevent potential OOMs throughout codecs will require a change to the full functionality of error handlers.

I personally don't see a reason to allow user-defined error handlers the amount of power that they currently have, but I thought that a change of that scope would require a lot more discussion, and that it'd be more important to first stop the interpreter from writing to un-allocated memory.

The utf_16 and utf_32 encoders (found in _PyUnicode_EncodeUTF16 and
_PyUnicode_EncodeUTF32, respectively) don't properly re-size the output
buffer. This leads to out-of-bounds writes, and segfaults.

This change ensures that the encoder will re-allocate the buffer even
when the general error handler only writes one codepoint, and will
allocate enough extra memory to fit the full result.
@serhiy-storchaka
Copy link
Member

Thank you. I am not sure the fix is correct. It overallocates more memory than necessary with "good" error handlers. This leads to OOM with "bad" error handler.

I think that the root bug is that encoders allow an infinite loop. After fixing this the out-of-bounds writes may be fixed too.

@csabella
Copy link
Contributor

@ATalaba, please address Serhiy's last comments. Thank you.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I would prefer if @serhiy-storchaka, our UTF-16 and UTF-32 codec expert :-D, could also review the change.

I also proposed a minor enhancement.

return ("a", exc.end if state.check() == 50 else exc.start)

codecs.register_error("err", err)
codecs.utf_32_encode("\udc80", "err")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can maybe add a test on len(State.num). I expect an exact number of calls.

Same remark for the utf_16_encode() test.

Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not Serhiy, but it looks good to me (bit obscure how we handle incorrectly-sized bytes results from the error handler, but it checks out).

@vstinner
Copy link
Member

@ATalaba: Would you mind to address my remark and rebase your PR to fix the merge conflict?

@serhiy-storchaka
Copy link
Member

See #28593 for more complete solution.

@serhiy-storchaka
Copy link
Member

Fixed in #28593.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants