Discourage use of str() type strings in python API #1582

drzaeus77 · 2018-02-08T23:33:56Z

This PR attempts to remove use of the python2 str() type object in favor of the more explicit bytes(), which has a consistent behavior between python2/3. Any unknown or command line arguments which are meant to be compared against kernel strings (e.g. things on the ring buffer) should be of this object type. There should be no more use of encode/decode to mangle data coming from an untrusted source, as this has been shown to confuse the utf encoder.

The approach that this change takes does put a burden on tools authors, since it requires use of b"" type strings for any arguments that are fed to a c API. This includes the text of the program to be compiled itself.

As an aid to conversion, all python APIs that wrap a C api have added a sanity check to the arguments that are expected to be c-strings, _assert_is_bytes(). For now, this helper will check and silently convert, in the safest way feasible, a utf string to a ascii string. The assertion internally has a warning that will report incorrect API usage with a command-line configurable check. For instance, running python -W default ./killsnoop.py will report 2 incorrect uses:

/usr/lib/python3.6/site-packages/bcc/__init__.py:265: DeprecationWarning: not a bytes object: '\n#include <uapi/linux/ptrace.h>\n#inclu ...'
  text = _assert_is_bytes(text)
TIME      PID    COMM             SIG  TPID   RESULT
/usr/lib/python3.6/site-packages/bcc/__init__.py:430: DeprecationWarning: not a bytes object: 'events'

Follow-on commits should attempt to clean up these warnings on a tool-by-tool basis.

Signed-off-by: Brenden Blanco <bblanco@gmail.com>

Introduce some helpers for managing bytes/unicode objects in a way that bridges the gap from python2 to 3. 1. Add printb() helper for writing bytes output directly to stdout. This avoids complaints from print() in python3, which expects a unicode str(). Since python 3.5, `b"" % bytes()` style format strings should work and we can write tools with common code, once we convert format strings to bytes. http://legacy.python.org/dev/peps/pep-0461/ 2. Add a class for wrapping command line arguments that are intended for comparing to debugged memory, for instance running process COMM or kernel pathname data. The approach takes some of the discussion from http://legacy.python.org/dev/peps/pep-0383/ into account, though unfortunately the python2-future implementation of "surrogateescape" is buggy, therefore this iteration is partial. The object instance should only ever be coerced into a bytes object. This silently invokes encode(sys.getfilesystemencoding()), which if it fails implies that the tool was passed junk characters on the command line. Thereafter the tool should implement only bytes-bytes comparisons (for instance re.search(b"", b"")) and bytes stdout printing (see printb). 3. Add an _assert_is_bytes helper to check for improper usage of str objects in python arguments. The behavior of the assertion can be tweaked by changing the bcc.utils._strict_bytes bool. Going forward, one should never invoke decode() on a bpf data stream, e.g. the result of a table lookup or perf ring output. Leave that data in the native bytes() representation. Signed-off-by: Brenden Blanco <bblanco@gmail.com>

Move bcc internals to sanitize all arguments as bytes instead of str type. For now, disable warnings or actual assertion, to be turned on incrementally.

Conform to bytes encoding for some portion of the tools/tests, such that smoke tests pass on python3. More conversions are surely required. Signed-off-by: Brenden Blanco <bblanco@gmail.com>

yonghong-song

LGTM. I think this is a reasonable approach to ease the pain of python 2/3 compability and avoid a lot of encode('ascii') in the code.

drzaeus77 added 4 commits February 8, 2018 15:22

Fix file desc leak in test_tools_smoke

c28f6e8

Signed-off-by: Brenden Blanco <bblanco@gmail.com>

python/bcc: add internal _assert_is_bytes usage

c0ca99a

Move bcc internals to sanitize all arguments as bytes instead of str type. For now, disable warnings or actual assertion, to be turned on incrementally.

tools: switch to v2/3 independent bytes usage

42d6098

Conform to bytes encoding for some portion of the tools/tests, such that smoke tests pass on python3. More conversions are surely required. Signed-off-by: Brenden Blanco <bblanco@gmail.com>

drzaeus77 requested review from brendangregg, goldshtn and yonghong-song as code owners February 8, 2018 23:33

yonghong-song approved these changes Feb 11, 2018

View reviewed changes

yonghong-song merged commit 5ec4cc4 into master Feb 12, 2018

drzaeus77 deleted the python_bytes branch February 21, 2018 19:23

drzaeus77 mentioned this pull request Apr 19, 2018

Add support for remote target devices via BPFd #1675

Closed

yonghong-song mentioned this pull request May 7, 2018

Fix USDT probes arguments' encoding in Python3 #1736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discourage use of str() type strings in python API #1582

Discourage use of str() type strings in python API #1582

drzaeus77 commented Feb 8, 2018

yonghong-song left a comment

Discourage use of str() type strings in python API #1582

Discourage use of str() type strings in python API #1582

Conversation

drzaeus77 commented Feb 8, 2018

yonghong-song left a comment

Choose a reason for hiding this comment