Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-96421: Introduce cleanup contexts for inlined frames #98795

Closed
wants to merge 2 commits into from

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Oct 28, 2022

This is intended as an alternative to #96319. I personally find that this is a bit simpler, more scalable, and easier to reason about.

Rather than pushing "shim" frames, give _PyInterpreterFrame a cleanup member that tells the interpreter how to handle its two exit paths (the old exit_unwind label on error, and the new cleanup label on success).

This proof-of-concept sketches out how inlining generators and class constructions would work under this new scheme by implementing their cleanup code in exit_unwind and cleanup. However, the specializations themselves are still unimplemented.

@brandtbucher brandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 28, 2022
@brandtbucher
Copy link
Member Author

Performance impact is neutral:

Slower (23):
- generators: 73.0 ms +- 0.4 ms -> 76.4 ms +- 0.6 ms: 1.05x slower
- pickle_list: 4.03 us +- 0.05 us -> 4.21 us +- 0.04 us: 1.04x slower
- scimark_monte_carlo: 65.8 ms +- 0.6 ms -> 67.7 ms +- 2.7 ms: 1.03x slower
- coroutines: 24.9 ms +- 0.2 ms -> 25.5 ms +- 0.3 ms: 1.03x slower
- unpickle: 13.3 us +- 0.2 us -> 13.6 us +- 1.1 us: 1.02x slower
- pidigits: 190 ms +- 0 ms -> 194 ms +- 0 ms: 1.02x slower
- crypto_pyaes: 72.4 ms +- 0.6 ms -> 73.9 ms +- 0.6 ms: 1.02x slower
- deepcopy_reduce: 2.89 us +- 0.03 us -> 2.95 us +- 0.07 us: 1.02x slower
- go: 132 ms +- 1 ms -> 135 ms +- 3 ms: 1.02x slower
- nqueens: 79.6 ms +- 0.7 ms -> 81.1 ms +- 3.1 ms: 1.02x slower
- deepcopy: 326 us +- 2 us -> 330 us +- 3 us: 1.01x slower
- sympy_sum: 162 ms +- 2 ms -> 165 ms +- 2 ms: 1.01x slower
- spectral_norm: 95.1 ms +- 1.8 ms -> 96.2 ms +- 1.3 ms: 1.01x slower
- deltablue: 3.25 ms +- 0.04 ms -> 3.29 ms +- 0.03 ms: 1.01x slower
- chameleon: 6.53 ms +- 0.05 ms -> 6.60 ms +- 0.10 ms: 1.01x slower
- scimark_lu: 108 ms +- 2 ms -> 109 ms +- 2 ms: 1.01x slower
- hexiom: 6.01 ms +- 0.05 ms -> 6.06 ms +- 0.06 ms: 1.01x slower
- fannkuch: 372 ms +- 3 ms -> 374 ms +- 7 ms: 1.01x slower
- logging_simple: 5.75 us +- 0.08 us -> 5.78 us +- 0.09 us: 1.01x slower
- 2to3: 248 ms +- 1 ms -> 249 ms +- 1 ms: 1.01x slower
- sympy_integrate: 20.4 ms +- 0.1 ms -> 20.5 ms +- 0.1 ms: 1.00x slower
- python_startup_no_site: 6.08 ms +- 0.01 ms -> 6.09 ms +- 0.01 ms: 1.00x slower
- python_startup: 8.44 ms +- 0.01 ms -> 8.45 ms +- 0.01 ms: 1.00x slower

Faster (23):
- unpack_sequence: 48.5 ns +- 4.3 ns -> 43.6 ns +- 0.9 ns: 1.11x faster
- mdp: 2.71 sec +- 0.01 sec -> 2.52 sec +- 0.02 sec: 1.08x faster
- regex_v8: 22.8 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.07x faster
- regex_effbot: 3.57 ms +- 0.01 ms -> 3.44 ms +- 0.02 ms: 1.04x faster
- float: 72.7 ms +- 1.0 ms -> 70.7 ms +- 1.0 ms: 1.03x faster
- chaos: 67.6 ms +- 0.8 ms -> 66.4 ms +- 0.7 ms: 1.02x faster
- mako: 9.71 ms +- 0.08 ms -> 9.56 ms +- 0.05 ms: 1.02x faster
- regex_dna: 204 ms +- 1 ms -> 201 ms +- 1 ms: 1.01x faster
- pickle_dict: 31.5 us +- 0.1 us -> 31.1 us +- 0.1 us: 1.01x faster
- pyflate: 399 ms +- 3 ms -> 393 ms +- 2 ms: 1.01x faster
- thrift: 742 us +- 22 us -> 732 us +- 10 us: 1.01x faster
- richards: 43.4 ms +- 1.2 ms -> 42.8 ms +- 1.2 ms: 1.01x faster
- scimark_fft: 318 ms +- 11 ms -> 314 ms +- 4 ms: 1.01x faster
- unpickle_pure_python: 205 us +- 3 us -> 203 us +- 2 us: 1.01x faster
- logging_silent: 92.5 ns +- 0.4 ns -> 91.7 ns +- 0.4 ns: 1.01x faster
- dulwich_log: 61.2 ms +- 0.4 ms -> 60.7 ms +- 0.4 ms: 1.01x faster
- gunicorn: 1.08 ms +- 0.01 ms -> 1.07 ms +- 0.00 ms: 1.01x faster
- sqlglot_transpile: 1.63 ms +- 0.02 ms -> 1.62 ms +- 0.02 ms: 1.01x faster
- sqlglot_optimize: 51.2 ms +- 0.4 ms -> 50.9 ms +- 0.3 ms: 1.01x faster
- deepcopy_memo: 34.1 us +- 0.6 us -> 33.9 us +- 0.5 us: 1.01x faster
- sqlglot_normalize: 106 ms +- 1 ms -> 105 ms +- 1 ms: 1.01x faster
- pycparser: 1.08 sec +- 0.02 sec -> 1.08 sec +- 0.02 sec: 1.01x faster
- sqlglot_parse: 1.34 ms +- 0.01 ms -> 1.33 ms +- 0.01 ms: 1.00x faster

Benchmark hidden because not significant (37): aiohttp, async_tree_none, async_tree_cpu_io_mixed, async_tree_io, async_tree_memoization, coverage, django_template, genshi_text, genshi_xml, html5lib, json, json_dumps, json_loads, logging_format, meteor_contest, mypy, nbody, pathlib, pickle, pickle_pure_python, pprint_safe_repr, pprint_pformat, pylint, raytrace, regex_compile, scimark_sor, scimark_sparse_mat_mult, sqlite_synth, sympy_expand, sympy_str, telco, tornado_http, unpickle_list, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00x faster

@brandtbucher
Copy link
Member Author

CC @markshannon. Let me know what you think!

@brandtbucher
Copy link
Member Author

(Sorry, I thought you were already requested for review, but I forgot that draft PRs don't do that.)

@markshannon
Copy link
Member

I don't see how this is simpler.

Once you strip out the change note and documentation, the PRs are about the same size.
Half of code added in #96319 is for creating these sort of trampoline functions. Even if we use the approach in this PR, it just isn't as general as the inserting shim functions, so we will need shim functions anyway.
In terms of changes to the interpreter this PR is much more complex.

How the cleanup contexts compose? How do you do two things in a cleanup context?
For example, since having to pop a chunk is when popping a frame is so rare, it might make sense to insert a shim frame to handle that (it probably wouldn't, it's just an example).
How would cleanup contexts handle that use case and another another special case?

@brandtbucher
Copy link
Member Author

I don't see how this is simpler.

What I find simpler is that all of the new complexity is limited to a very specific part of _PyEval_EvalFrameDefault, rather than than everywhere frames are used. It also maintains a simple, intuitive rule that has served us well thus far: one frame per Python call in the source program. It also avoids the overhead of pushing and initializing additional frames in lots of places, which seems like a real cost (reflected in the benchmark results), not to mention the burden of maintaining hand-written bytecode, location tables, etc.

Once you strip out the change note and documentation, the PRs are about the same size. Half of code added in #96319 is for creating these sort of trampoline functions.

Well, that's a bit unfair. The majority of this PR is the PoC implementations of generator and class inlining, which I thought you might like to see. I've removed those parts since the point's been made, and the PR is downright tiny now. ;)

Also, this PR doesn't need as much documentation, since it doesn't break out-of-process debuggers and profilers like the other PR (which I see as a strong point in favor of this approach).

Even if we use the approach in this PR, it just isn't as general as the inserting shim functions, so we will need shim functions anyway.

For what? I've demonstrated that inlining generators and class construction is quite straightforward with this approach. To me, it seems general enough for other places where we might want to insert arbitrary cleanup code between frames.

I could be missing something, though. What else do we "need" shim functions for?

In terms of changes to the interpreter this PR is much more complex.

If by "interpreter" you mean "the eval loop", then maybe. But if by "interpreter" you mean "the entire runtime", then I disagree (strongly).

How the cleanup contexts compose? How do you do two things in a cleanup context? For example, since having to pop a chunk is when popping a frame is so rare, it might make sense to insert a shim frame to handle that (it probably wouldn't, it's just an example). How would cleanup contexts handle that use case and another another special case?

If we really needed something like this (it sounds like we might not, though), I imagine it could be quite straightforward to do using flags for composable handlers rather than enumerated values, as I've done here.

Coming out of the 3.11 prereleases, I feel quite strongly that the last things we need in 3.12 are gratuitous third-party breakage and additional (delicate) rules about how frames should be handled. Why not start simple by leaving shim frames out entirely, then add all of that only when we have a justified need? I'm not 100% opposed to shim frames, I just think that an approach like this one is a better way of solving the same set of problems right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants