Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discourage use of str() type strings in python API #1582

Merged
merged 4 commits into from
Feb 12, 2018
Merged

Conversation

drzaeus77
Copy link
Collaborator

This PR attempts to remove use of the python2 str() type object in favor of the more explicit bytes(), which has a consistent behavior between python2/3. Any unknown or command line arguments which are meant to be compared against kernel strings (e.g. things on the ring buffer) should be of this object type. There should be no more use of encode/decode to mangle data coming from an untrusted source, as this has been shown to confuse the utf encoder.

The approach that this change takes does put a burden on tools authors, since it requires use of b"" type strings for any arguments that are fed to a c API. This includes the text of the program to be compiled itself.

As an aid to conversion, all python APIs that wrap a C api have added a sanity check to the arguments that are expected to be c-strings, _assert_is_bytes(). For now, this helper will check and silently convert, in the safest way feasible, a utf string to a ascii string. The assertion internally has a warning that will report incorrect API usage with a command-line configurable check. For instance, running python -W default ./killsnoop.py will report 2 incorrect uses:

/usr/lib/python3.6/site-packages/bcc/__init__.py:265: DeprecationWarning: not a bytes object: '\n#include <uapi/linux/ptrace.h>\n#inclu ...'
  text = _assert_is_bytes(text)
TIME      PID    COMM             SIG  TPID   RESULT
/usr/lib/python3.6/site-packages/bcc/__init__.py:430: DeprecationWarning: not a bytes object: 'events'

Follow-on commits should attempt to clean up these warnings on a tool-by-tool basis.

Signed-off-by: Brenden Blanco <bblanco@gmail.com>
Introduce some helpers for managing bytes/unicode objects in a way that
bridges the gap from python2 to 3.

1. Add printb() helper for writing bytes output directly to stdout. This
avoids complaints from print() in python3, which expects a unicode
str(). Since python 3.5, `b"" % bytes()` style format strings should
work and we can write tools with common code, once we convert format
strings to bytes.
http://legacy.python.org/dev/peps/pep-0461/

2. Add a class for wrapping command line arguments that are intended for
comparing to debugged memory, for instance running process COMM or
kernel pathname data. The approach takes some of the discussion from
http://legacy.python.org/dev/peps/pep-0383/ into account, though
unfortunately the python2-future implementation of "surrogateescape" is
buggy, therefore this iteration is partial.

The object instance should only ever be coerced into a bytes object.
This silently invokes encode(sys.getfilesystemencoding()), which if it
fails implies that the tool was passed junk characters on the command
line. Thereafter the tool should implement only bytes-bytes comparisons
(for instance re.search(b"", b"")) and bytes stdout printing (see
printb).

3. Add an _assert_is_bytes helper to check for improper usage of str
objects in python arguments. The behavior of the assertion can be
tweaked by changing the bcc.utils._strict_bytes bool.

Going forward, one should never invoke decode() on a bpf data stream,
e.g. the result of a table lookup or perf ring output. Leave that data
in the native bytes() representation.

Signed-off-by: Brenden Blanco <bblanco@gmail.com>
Move bcc internals to sanitize all arguments as bytes instead of str
type. For now, disable warnings or actual assertion, to be turned on
incrementally.
Conform to bytes encoding for some portion of the tools/tests, such that
smoke tests pass on python3. More conversions are surely required.

Signed-off-by: Brenden Blanco <bblanco@gmail.com>
Copy link
Collaborator

@yonghong-song yonghong-song left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think this is a reasonable approach to ease the pain of python 2/3 compability and avoid a lot of encode('ascii') in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants