GH-96421: Introduce cleanup contexts for inlined frames #98795

brandtbucher · 2022-10-28T05:50:08Z

This is intended as an alternative to #96319. I personally find that this is a bit simpler, more scalable, and easier to reason about.

Rather than pushing "shim" frames, give _PyInterpreterFrame a cleanup member that tells the interpreter how to handle its two exit paths (the old exit_unwind label on error, and the new cleanup label on success).

This proof-of-concept sketches out how inlining generators and class constructions would work under this new scheme by implementing their cleanup code in exit_unwind and cleanup. However, the specializations themselves are still unimplemented.

Issue: Insert shim frames at entries points to the interpreter. #96421

brandtbucher · 2022-10-28T05:51:35Z

Performance impact is neutral:

Slower (23):
- generators: 73.0 ms +- 0.4 ms -> 76.4 ms +- 0.6 ms: 1.05x slower
- pickle_list: 4.03 us +- 0.05 us -> 4.21 us +- 0.04 us: 1.04x slower
- scimark_monte_carlo: 65.8 ms +- 0.6 ms -> 67.7 ms +- 2.7 ms: 1.03x slower
- coroutines: 24.9 ms +- 0.2 ms -> 25.5 ms +- 0.3 ms: 1.03x slower
- unpickle: 13.3 us +- 0.2 us -> 13.6 us +- 1.1 us: 1.02x slower
- pidigits: 190 ms +- 0 ms -> 194 ms +- 0 ms: 1.02x slower
- crypto_pyaes: 72.4 ms +- 0.6 ms -> 73.9 ms +- 0.6 ms: 1.02x slower
- deepcopy_reduce: 2.89 us +- 0.03 us -> 2.95 us +- 0.07 us: 1.02x slower
- go: 132 ms +- 1 ms -> 135 ms +- 3 ms: 1.02x slower
- nqueens: 79.6 ms +- 0.7 ms -> 81.1 ms +- 3.1 ms: 1.02x slower
- deepcopy: 326 us +- 2 us -> 330 us +- 3 us: 1.01x slower
- sympy_sum: 162 ms +- 2 ms -> 165 ms +- 2 ms: 1.01x slower
- spectral_norm: 95.1 ms +- 1.8 ms -> 96.2 ms +- 1.3 ms: 1.01x slower
- deltablue: 3.25 ms +- 0.04 ms -> 3.29 ms +- 0.03 ms: 1.01x slower
- chameleon: 6.53 ms +- 0.05 ms -> 6.60 ms +- 0.10 ms: 1.01x slower
- scimark_lu: 108 ms +- 2 ms -> 109 ms +- 2 ms: 1.01x slower
- hexiom: 6.01 ms +- 0.05 ms -> 6.06 ms +- 0.06 ms: 1.01x slower
- fannkuch: 372 ms +- 3 ms -> 374 ms +- 7 ms: 1.01x slower
- logging_simple: 5.75 us +- 0.08 us -> 5.78 us +- 0.09 us: 1.01x slower
- 2to3: 248 ms +- 1 ms -> 249 ms +- 1 ms: 1.01x slower
- sympy_integrate: 20.4 ms +- 0.1 ms -> 20.5 ms +- 0.1 ms: 1.00x slower
- python_startup_no_site: 6.08 ms +- 0.01 ms -> 6.09 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.44 ms +- 0.01 ms -> 8.45 ms +- 0.01 ms: 1.00x slower

Faster (23):
- unpack_sequence: 48.5 ns +- 4.3 ns -> 43.6 ns +- 0.9 ns: 1.11x faster
- mdp: 2.71 sec +- 0.01 sec -> 2.52 sec +- 0.02 sec: 1.08x faster
- regex_v8: 22.8 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.07x faster
- regex_effbot: 3.57 ms +- 0.01 ms -> 3.44 ms +- 0.02 ms: 1.04x faster
- float: 72.7 ms +- 1.0 ms -> 70.7 ms +- 1.0 ms: 1.03x faster
- chaos: 67.6 ms +- 0.8 ms -> 66.4 ms +- 0.7 ms: 1.02x faster
- mako: 9.71 ms +- 0.08 ms -> 9.56 ms +- 0.05 ms: 1.02x faster
- regex_dna: 204 ms +- 1 ms -> 201 ms +- 1 ms: 1.01x faster
- pickle_dict: 31.5 us +- 0.1 us -> 31.1 us +- 0.1 us: 1.01x faster
- pyflate: 399 ms +- 3 ms -> 393 ms +- 2 ms: 1.01x faster
- thrift: 742 us +- 22 us -> 732 us +- 10 us: 1.01x faster
- richards: 43.4 ms +- 1.2 ms -> 42.8 ms +- 1.2 ms: 1.01x faster
- scimark_fft: 318 ms +- 11 ms -> 314 ms +- 4 ms: 1.01x faster
- unpickle_pure_python: 205 us +- 3 us -> 203 us +- 2 us: 1.01x faster
- logging_silent: 92.5 ns +- 0.4 ns -> 91.7 ns +- 0.4 ns: 1.01x faster
- dulwich_log: 61.2 ms +- 0.4 ms -> 60.7 ms +- 0.4 ms: 1.01x faster
- gunicorn: 1.08 ms +- 0.01 ms -> 1.07 ms +- 0.00 ms: 1.01x faster
- sqlglot_transpile: 1.63 ms +- 0.02 ms -> 1.62 ms +- 0.02 ms: 1.01x faster
- sqlglot_optimize: 51.2 ms +- 0.4 ms -> 50.9 ms +- 0.3 ms: 1.01x faster
- deepcopy_memo: 34.1 us +- 0.6 us -> 33.9 us +- 0.5 us: 1.01x faster
- sqlglot_normalize: 106 ms +- 1 ms -> 105 ms +- 1 ms: 1.01x faster
- pycparser: 1.08 sec +- 0.02 sec -> 1.08 sec +- 0.02 sec: 1.01x faster
- sqlglot_parse: 1.34 ms +- 0.01 ms -> 1.33 ms +- 0.01 ms: 1.00x faster

Benchmark hidden because not significant (37): aiohttp, async_tree_none, async_tree_cpu_io_mixed, async_tree_io, async_tree_memoization, coverage, django_template, genshi_text, genshi_xml, html5lib, json, json_dumps, json_loads, logging_format, meteor_contest, mypy, nbody, pathlib, pickle, pickle_pure_python, pprint_safe_repr, pprint_pformat, pylint, raytrace, regex_compile, scimark_sor, scimark_sparse_mat_mult, sqlite_synth, sympy_expand, sympy_str, telco, tornado_http, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00x faster

brandtbucher · 2022-10-28T14:00:59Z

CC @markshannon. Let me know what you think!

brandtbucher · 2022-10-28T14:20:13Z

(Sorry, I thought you were already requested for review, but I forgot that draft PRs don't do that.)

markshannon · 2022-10-28T15:46:01Z

I don't see how this is simpler.

Once you strip out the change note and documentation, the PRs are about the same size.
Half of code added in #96319 is for creating these sort of trampoline functions. Even if we use the approach in this PR, it just isn't as general as the inserting shim functions, so we will need shim functions anyway.
In terms of changes to the interpreter this PR is much more complex.

How the cleanup contexts compose? How do you do two things in a cleanup context?
For example, since having to pop a chunk is when popping a frame is so rare, it might make sense to insert a shim frame to handle that (it probably wouldn't, it's just an example).
How would cleanup contexts handle that use case and another another special case?

brandtbucher · 2022-10-31T04:12:54Z

I don't see how this is simpler.

What I find simpler is that all of the new complexity is limited to a very specific part of _PyEval_EvalFrameDefault, rather than than everywhere frames are used. It also maintains a simple, intuitive rule that has served us well thus far: one frame per Python call in the source program. It also avoids the overhead of pushing and initializing additional frames in lots of places, which seems like a real cost (reflected in the benchmark results), not to mention the burden of maintaining hand-written bytecode, location tables, etc.

Once you strip out the change note and documentation, the PRs are about the same size. Half of code added in #96319 is for creating these sort of trampoline functions.

Well, that's a bit unfair. The majority of this PR is the PoC implementations of generator and class inlining, which I thought you might like to see. I've removed those parts since the point's been made, and the PR is downright tiny now. ;)

Also, this PR doesn't need as much documentation, since it doesn't break out-of-process debuggers and profilers like the other PR (which I see as a strong point in favor of this approach).

Even if we use the approach in this PR, it just isn't as general as the inserting shim functions, so we will need shim functions anyway.

For what? I've demonstrated that inlining generators and class construction is quite straightforward with this approach. To me, it seems general enough for other places where we might want to insert arbitrary cleanup code between frames.

I could be missing something, though. What else do we "need" shim functions for?

In terms of changes to the interpreter this PR is much more complex.

If by "interpreter" you mean "the eval loop", then maybe. But if by "interpreter" you mean "the entire runtime", then I disagree (strongly).

How the cleanup contexts compose? How do you do two things in a cleanup context? For example, since having to pop a chunk is when popping a frame is so rare, it might make sense to insert a shim frame to handle that (it probably wouldn't, it's just an example). How would cleanup contexts handle that use case and another another special case?

If we really needed something like this (it sounds like we might not, though), I imagine it could be quite straightforward to do using flags for composable handlers rather than enumerated values, as I've done here.

Coming out of the 3.11 prereleases, I feel quite strongly that the last things we need in 3.12 are gratuitous third-party breakage and additional (delicate) rules about how frames should be handled. Why not start simple by leaving shim frames out entirely, then add all of that only when we have a justified need? I'm not 100% opposed to shim frames, I just think that an approach like this one is a better way of solving the same set of problems right now.

First pass at handling various cleanup contexts

24bbfa1

brandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 28, 2022

bedevere-bot added the awaiting core review label Oct 28, 2022

brandtbucher added the skip news label Oct 28, 2022

Rip out the generator and class cleanup code

661ccd8

bedevere-bot mentioned this pull request Oct 31, 2022

Insert shim frames at entries points to the interpreter. #96421

Closed

brandtbucher closed this Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-96421: Introduce cleanup contexts for inlined frames #98795

GH-96421: Introduce cleanup contexts for inlined frames #98795

brandtbucher commented Oct 28, 2022 •

edited

Loading

brandtbucher commented Oct 28, 2022

brandtbucher commented Oct 28, 2022

brandtbucher commented Oct 28, 2022

markshannon commented Oct 28, 2022

brandtbucher commented Oct 31, 2022

GH-96421: Introduce cleanup contexts for inlined frames #98795

GH-96421: Introduce cleanup contexts for inlined frames #98795

Conversation

brandtbucher commented Oct 28, 2022 • edited Loading

brandtbucher commented Oct 28, 2022

brandtbucher commented Oct 28, 2022

brandtbucher commented Oct 28, 2022

markshannon commented Oct 28, 2022

brandtbucher commented Oct 31, 2022

brandtbucher commented Oct 28, 2022 •

edited

Loading