-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-36819: Fix out-of-bounds writes in encoders #13134
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. If you have recently signed the CLA, please wait at least one business day You can check yourself to see if the CLA has been received. Thanks again for your contribution, we look forward to reviewing it! |
Could you please add tests? |
The repro in the bpo no longer aborts python with these fixes, but the examples cause infinite loops. Allowing the error to return a continue position at where it fixed is probably a bug that we should detect. |
All (or at least all that I've checked) builtin encoder/decoders trust the error handlers they dispatch to enough that they just set their continue position to whatever the handler returns (which is documented on the codecs doc page). Decoders trust their error handlers enough that error handlers are allowed to change the bytes object being decoded. I figured this was the desired behavior and made the PR to match it. A fix that tries to prevent potential OOMs throughout codecs will require a change to the full functionality of error handlers. I personally don't see a reason to allow user-defined error handlers the amount of power that they currently have, but I thought that a change of that scope would require a lot more discussion, and that it'd be more important to first stop the interpreter from writing to un-allocated memory. |
The utf_16 and utf_32 encoders (found in _PyUnicode_EncodeUTF16 and _PyUnicode_EncodeUTF32, respectively) don't properly re-size the output buffer. This leads to out-of-bounds writes, and segfaults. This change ensures that the encoder will re-allocate the buffer even when the general error handler only writes one codepoint, and will allocate enough extra memory to fit the full result.
26b1445
to
342643b
Compare
Thank you. I am not sure the fix is correct. It overallocates more memory than necessary with "good" error handlers. This leads to OOM with "bad" error handler. I think that the root bug is that encoders allow an infinite loop. After fixing this the out-of-bounds writes may be fixed too. |
@ATalaba, please address Serhiy's last comments. Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I would prefer if @serhiy-storchaka, our UTF-16 and UTF-32 codec expert :-D, could also review the change.
I also proposed a minor enhancement.
return ("a", exc.end if state.check() == 50 else exc.start) | ||
|
||
codecs.register_error("err", err) | ||
codecs.utf_32_encode("\udc80", "err") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can maybe add a test on len(State.num). I expect an exact number of calls.
Same remark for the utf_16_encode() test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not Serhiy, but it looks good to me (bit obscure how we handle incorrectly-sized bytes results from the error handler, but it checks out).
@ATalaba: Would you mind to address my remark and rebase your PR to fix the merge conflict? |
See #28593 for more complete solution. |
Fixed in #28593. |
The utf_16 and utf_32 encoders (found in _PyUnicode_EncodeUTF16 and
_PyUnicode_EncodeUTF32, respectively) don't properly re-size the output
buffer. This leads to out-of-bounds writes, and segfaults.
This change ensures that the encoder will re-allocate the buffer even
when the general error handler only writes one codepoint, and will
allocate enough extra memory to fit the full result.
https://bugs.python.org/issue36819