Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC crash _PyObject_AssertFailed with pdb #94215

Closed
kumaraditya303 opened this issue Jun 24, 2022 · 23 comments
Closed

GC crash _PyObject_AssertFailed with pdb #94215

kumaraditya303 opened this issue Jun 24, 2022 · 23 comments
Assignees
Labels
3.11 only security fixes 3.12 bugs and security fixes type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented Jun 24, 2022

Reproducer:

def func():
    print(
      42
    )

import pdb; pdb.set_trace()
func()

Crash:

@kumaraditya303 ➜ /workspaces/cpython (main ✗) $ ./python main.py 
> /workspaces/cpython/main.py(7)<module>()
-> func()
(Pdb) s
--Call--
> /workspaces/cpython/main.py(1)func()
-> def func():
(Pdb) n
> /workspaces/cpython/main.py(2)func()
-> print(
(Pdb) n
> /workspaces/cpython/main.py(3)func()
-> 42
(Pdb) j 1
> /workspaces/cpython/main.py(1)func()
-> def func():
(Pdb) exit
Traceback (most recent call last):
  File "/workspaces/cpython/main.py", line 7, in <module>
    func()
    ^^^^^^
  File "/workspaces/cpython/main.py", line 1, in func
    def func():
    
  File "/workspaces/cpython/main.py", line 1, in func
    def func():
    
  File "/workspaces/cpython/Lib/bdb.py", line 90, in trace_dispatch
    return self.dispatch_line(frame)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/cpython/Lib/bdb.py", line 115, in dispatch_line
    if self.quitting: raise BdbQuit
                      ^^^^^^^^^^^^^
bdb.BdbQuit
Modules/gcmodule.c:113: gc_decref: Assertion "gc_get_refs(g) > 0" failed: refcount is too small
Enable tracemalloc to get the memory block allocation traceback

object address  : 0x7f18ff3b81d0
object refcount : 2
object type     : 0x55cfef2a6460
object type name: builtin_function_or_method
object repr     : <built-in function print>

Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: finalizing (tstate=0x000055cfef3bca48)

Current thread 0x00007f18ff707280 (most recent call first):
  Garbage-collecting
  <no Python frame>
Aborted (core dumped)
GDB stack trace (main)
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7c71859 in __GI_abort () at abort.c:79
#2  0x000055555585db52 in fatal_error_exit (status=-1) at Python/pylifecycle.c:2614
#3  fatal_error (fd=2, header=header@entry=1, prefix=<optimized out>, 
    prefix@entry=0x55555594cd70 <__func__.19026> "_PyObject_AssertFailed", msg=<optimized out>, 
    msg@entry=0x55555594c575 "_PyObject_AssertFailed", status=status@entry=-1)
    at Python/pylifecycle.c:2795
#4  0x000055555585dbb4 in _Py_FatalErrorFunc (
    func=func@entry=0x55555594cd70 <__func__.19026> "_PyObject_AssertFailed", 
    msg=msg@entry=0x55555594c575 "_PyObject_AssertFailed") at Python/pylifecycle.c:2811
#5  0x00005555557431d2 in _PyObject_AssertFailed (
    obj=obj@entry=<built-in method print of module object at remote 0x7ffff78f32f0>, 
    expr=expr@entry=0x5555559ab18c "gc_get_refs(g) > 0", 
    msg=msg@entry=0x55555594c5eb "refcount is too small", 
    file=file@entry=0x5555559ab11c "Modules/gcmodule.c", line=line@entry=113, 
    function=function@entry=0x5555559ab6c8 <__func__.18373> "gc_decref") at Objects/object.c:2360
#6  0x000055555588ad50 in gc_decref (g=0x7ffff79001c0) at Modules/gcmodule.c:116
#7  visit_decref (op=<built-in method print of module object at remote 0x7ffff78f32f0>, 
    parent=0x7ffff7900ad0) at Modules/gcmodule.c:459
#8  0x00005555557249a5 in dict_traverse (op=<optimized out>, visit=0x55555588ac96 <visit_decref>, 
    arg=0x7ffff7900ad0) at Objects/dictobject.c:3547
#9  0x000055555588a741 in subtract_refs (
    containers=containers@entry=0x555555b869a8 <_PyRuntime+53896>) at Modules/gcmodule.c:478
#10 0x000055555588be95 in deduce_unreachable (unreachable=0x7fffffffc1d0, 
    base=0x555555b869a8 <_PyRuntime+53896>) at Modules/gcmodule.c:1100
#11 gc_collect_main (tstate=tstate@entry=0x555555ba0a48 <_PyRuntime+160552>, 
    generation=generation@entry=2, n_collected=n_collected@entry=0x7fffffffc258, 
    n_uncollectable=n_uncollectable@entry=0x7fffffffc260, nofail=nofail@entry=0)
    at Modules/gcmodule.c:1226

Python versions tested:

  • Python 3.11.0b3+ 41e4b42 (remotes/upstream/3.11:41e4b426ce, Jun 24 2022, 11:58:10) [GCC 9.4.0] on linux
  • Python 3.12.0a0 51fd4de (heads/main:51fd4de101, Jun 24 2022, 12:02:08) [GCC 9.4.0] on linux

cc @markshannon @pablogsal

@kumaraditya303 kumaraditya303 added 3.11 only security fixes type-crash A hard crash of the interpreter, possibly with a core dump 3.12 bugs and security fixes release-blocker labels Jun 24, 2022
@tiran
Copy link
Member

tiran commented Jun 29, 2022

I can reproduce the issue in a pydebug build:

../../Modules/gcmodule.c:113: gc_decref: Assertion "gc_get_refs(g) > 0" failed: refcount is too small
Enable tracemalloc to get the memory block allocation traceback

object address  : 0x7fffea698230
object refcount : 2
object type     : 0x7ccfa0
object type name: builtin_function_or_method
object repr     : <built-in function print>

Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: finalizing (tstate=0x00000000008e2fa8)

Current thread 0x00007ffff7cb7740 (most recent call first):
  Garbage-collecting
  <no Python frame>

Program received signal SIGABRT, Aborted.
(gdb) bt
#0  0x00007ffff7d48c4c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007ffff7cf89c6 in raise () from /lib64/libc.so.6
#2  0x00007ffff7ce27f4 in abort () from /lib64/libc.so.6
#3  0x00000000005b55dc in fatal_error_exit (status=<optimized out>) at ../../Python/pylifecycle.c:2614
#4  0x00000000005b69da in fatal_error (fd=2, header=header@entry=1, prefix=prefix@entry=0x6799b0 <__func__.1> "_PyObject_AssertFailed", msg=msg@entry=0x67985d "_PyObject_AssertFailed", status=status@entry=-1)
    at ../../Python/pylifecycle.c:2795
#5  0x00000000005b6a3b in _Py_FatalErrorFunc (func=func@entry=0x6799b0 <__func__.1> "_PyObject_AssertFailed", msg=msg@entry=0x67985d "_PyObject_AssertFailed") at ../../Python/pylifecycle.c:2811
#6  0x00000000004e855f in _PyObject_AssertFailed (obj=<built-in method print of module object at remote 0x7fffea68b350>, expr=expr@entry=0x70d394 "gc_get_refs(g) > 0", 
    msg=msg@entry=0x6798a9 "refcount is too small", file=file@entry=0x70d1fc "../../Modules/gcmodule.c", line=line@entry=113, function=function@entry=0x70d7c8 <__func__.26> "gc_decref")
    at ../../Objects/object.c:2360
#7  0x00000000005d9e20 in gc_decref (g=<optimized out>) at ../../Modules/gcmodule.c:113
#8  0x00000000005da4a5 in visit_decref (op=<built-in method print of module object at remote 0x7fffea68b350>, parent=0x7fffea698b30) at ../../Modules/gcmodule.c:459
#9  0x00000000004d347e in dict_traverse (op=<optimized out>, visit=0x5da446 <visit_decref>, arg=0x7fffea698b30) at ../../Objects/dictobject.c:3547
#10 0x00000000005d8d78 in subtract_refs (containers=containers@entry=0x8c8f08 <_PyRuntime+53896>) at ../../Modules/gcmodule.c:478
#11 0x00000000005d9f16 in deduce_unreachable (base=base@entry=0x8c8f08 <_PyRuntime+53896>, unreachable=unreachable@entry=0x7fffffffd410) at ../../Modules/gcmodule.c:1100
#12 0x00000000005da956 in gc_collect_main (tstate=tstate@entry=0x8e2fa8 <_PyRuntime+160552>, generation=generation@entry=2, n_collected=n_collected@entry=0x7fffffffd468, 
    n_uncollectable=n_uncollectable@entry=0x7fffffffd460, nofail=nofail@entry=0) at ../../Modules/gcmodule.c:1226
#13 0x00000000005dae4e in gc_collect_with_callback (tstate=tstate@entry=0x8e2fa8 <_PyRuntime+160552>, generation=generation@entry=2) at ../../Modules/gcmodule.c:1400
#14 0x00000000005db3c9 in PyGC_Collect () at ../../Modules/gcmodule.c:2086
#15 0x00000000005b6601 in Py_FinalizeEx () at ../../Python/pylifecycle.c:1823
#16 0x00000000005d8c25 in Py_RunMain () at ../../Modules/main.c:691
#17 0x00000000005d8c75 in pymain_main (args=args@entry=0x7fffffffd550) at ../../Modules/main.c:719
#18 0x00000000005d8cfa in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at ../../Modules/main.c:743
#19 0x000000000041d73f in main (argc=<optimized out>, argv=<optimized out>) at ../../Programs/python.c:15
(gdb) p op
$14 = <built-in method print of module object at remote 0x7fffea68b350>
(gdb) p *((PyGC_Head *)(((char *)(op))-sizeof(PyGC_Head)))
$15 = {_gc_next = 140737126171264, _gc_prev = 2}

@markshannon
Copy link
Member

This might be related to #94438

@tiran
Copy link
Member

tiran commented Jul 5, 2022

It's seems to be a ref counting issue. The jump and frame_setlineno seems to reduce the reference count by one. Please notice that the case without jump starts interpreter shutdown with gc: 2, refcnt: 3. With jump 1 shutdown starts with gc: 1, refcnt: 2.

ref counts without jump

> tc.py(11)<module>()
-> func()
(Pdb) s
--Call--
> tc.py(1)func()
-> def func():
(Pdb) n
> tc.py(2)func()
-> print(
(Pdb) n
> tc.py(3)func()
-> 42
(Pdb) exit
...
bdb.BdbQuit
gc: 2, refcnt: 3
gc: 1, refcnt: 3
gc: 2, refcnt: 3
gc: 1, refcnt: 3
gc: 1, refcnt: 1
gc: 1, refcnt: 1

ref counts with jump

> tc.py(11)<module>()
-> func()
(Pdb) s
--Call--
> tc.py(1)func()
-> def func():
(Pdb) n
> tc.py(2)func()
-> print(
(Pdb) n
> tc.py(3)func()
-> 42
(Pdb) j 1
> tc.py(1)func()
-> def func():
(Pdb) exit
...
bdb.BdbQuit
gc: 1, refcnt: 2
../../Modules/gcmodule.c:113: gc_decref: Assertion "gc_get_refs(g) > 0" failed: refcount is too small

debug hack

--- a/Objects/methodobject.c
+++ b/Objects/methodobject.c
@@ -249,6 +249,10 @@ meth_traverse(PyCFunctionObject *m, visitproc visit, void *arg)
     Py_VISIT(PyCFunction_GET_CLASS(m));
     Py_VISIT(m->m_self);
     Py_VISIT(m->m_module);
+    if (strcmp(m->m_ml->ml_name, "print") == 0) {
+        #define AS_GC(o) ((PyGC_Head *)(((char *)(o))-sizeof(PyGC_Head)))
+        fprintf(stderr, "gc: %ld, refcnt: %ld\n", (Py_ssize_t)(AS_GC(m)->_gc_prev >> _PyGC_PREV_SHIFT), Py_REFCNT(m));
+    }
     return 0;
 }

@tiran
Copy link
Member

tiran commented Jul 5, 2022

The decref in frame_stack_pop seems to be the culprit. If I drop the decref, then pdb session no longer crashes. refleak checks for test_frame test_code test_traceback test_pdb do not show any refleaks either.

static void
frame_stack_pop(PyFrameObject *f)
{
    PyObject *v = _PyFrame_StackPop(f->f_frame);
    Py_XDECREF(v);
}

@iritkatriel
Copy link
Member

I'm not sure how much that function is covered by tests. Can you verify whether it is?

@tiran
Copy link
Member

tiran commented Jul 5, 2022

frame_stack_pop() is only used by frame_setlineno() conditionally. I guess it is not covered by any test case at all. Otherwise we would have seen crashers.

@pablogsal
Copy link
Member

The issue may be a bit more general. Notice the jump is going back to the bytecode so we need to ensure that jumps and continues are always idempotent

@tiran
Copy link
Member

tiran commented Jul 5, 2022

Python 3.9 and 3.10 are affected, too. The reproducer crashes debug builds with assertion error

python: ../../Python/ceval.c:1493: _PyEval_EvalFrameDefault: Assertion `STACK_LEVEL() <= co->co_stacksize' failed.

@kumaraditya303
Copy link
Contributor Author

Python 3.9 and 3.10 are affected, too. The reproducer crashes debug builds with assertion error

python: ../../Python/ceval.c:1493: _PyEval_EvalFrameDefault: Assertion `STACK_LEVEL() <= co->co_stacksize' failed.

That's a different issue. See #91742 (comment)

@tiran
Copy link
Member

tiran commented Jul 7, 2022

I have an isolated test case and instructions how to trace the reference count of the problematic object with gdb.

def func():
    def inner(v): pass
    print(f"watch ((PyObject*){id(inner)})->ob_refcnt")
    inner(
        42
    )

print("s n n n j 4 exit")
import pdb; pdb.set_trace()
func()

# tc.py
# build Python with --with-pydebug
# $ gdb ./python --ex "run tc.py"
# in pdb run: "s" "n" "n" "n" "CTRL+Z"
# in gdb run the "watch" command and continue ("c" "c")
# in pdb eventually run "n j 4" "exit"

The reference counting problem could be a in frame_stack_pop, _PyFrame_LocalsToFast, _PyFrame_FastToLocalsWithError, or exception_unwind block around ceval.c:5813.

tiran added a commit to tiran/cpython that referenced this issue Jul 7, 2022
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 7, 2022
…honGH-94563)

(cherry picked from commit de58842)

Co-authored-by: Christian Heimes <christian@python.org>
miss-islington added a commit that referenced this issue Jul 7, 2022
(cherry picked from commit de58842)

Co-authored-by: Christian Heimes <christian@python.org>
@encukou
Copy link
Member

encukou commented Jul 7, 2022

Bisection using that reproducer shows that the failure was introduced with "Zero cost" exception handling, back in gh-25729.

@iritkatriel
Copy link
Member

I believe what's happening is that the ceval.c unwind code is unaware of the fact that frame_stack_pop already popped a couple of items from the stack. So stack_pointer is wrong there.

It seems to fix it if I add

                PyObject **current_stack_pointer = _PyFrame_GetStackPointer(frame);
                stack_pointer = (current_stack_pointer < stackbase) ? stackbase : current_stack_pointer;

just before

                while (stack_pointer > stackbase) {
                    PyObject *o = POP();
                    Py_XDECREF(o);
                }

@tiran
Copy link
Member

tiran commented Jul 7, 2022

The change introduces a reference leak

$ ./python -m test test_pdb -R::
0:00:00 load avg: 1.33 Run tests sequentially
0:00:00 load avg: 1.33 [1/1] test_pdb
beginning 9 repetitions
123456789
.........
test_pdb leaked [255, 255, 255, 255] references, sum=1020
test_pdb leaked [134, 129, 51, 136] memory blocks, sum=450
test_pdb failed (reference leak)

@tiran
Copy link
Member

tiran commented Jul 7, 2022

This code does neither crash nor cause a ref leak, though:

                if (frame->stacktop != -1) {
                    PyObject **current_stack_pointer = _PyFrame_GetStackPointer(frame);
                    stack_pointer = (current_stack_pointer < stackbase) ? stackbase : current_stack_pointer;
                }

@iritkatriel
Copy link
Member

ah if you do this if I think you can just assign stack_pointer to _PyFrame_GetStackPointer(frame);

@iritkatriel
Copy link
Member

Or just this and then we don't need to access frame->stacktop.

                PyObject **current_stack_pointer = _PyFrame_GetStackPointer(frame);
                if (current_stack_pointer >= stackbase) {
                    stack_pointer = current_stack_pointer;
                }

@iritkatriel
Copy link
Member

Maybe _PyFrame_GetStackPointer should return NULL when stack top is -1?

@ericsnowcurrently
Copy link
Member

I've been looking at Include/internal/pycore_frame.h, Objects/frameobject.c, and Python/ceval.c. Let me see if I've understood the problem.

  1. the authoritative pointer to the top of a frame's stack is provided by _PyFrame_GetStackPointer()
  2. the pointer can only be modified by a call to _PyFrame_SetStackPointer(), _PyFrame_StackPop(), or _PyFrame_StackPush()
  3. stack_pointer is a C local (for the sake of eval loop performance) that is meant to match the pointer to the top of the frame's stack
  4. any code that uses stack_pointer (e.g. passing to vectorcall) relies on it being in sync with frame->stacktop
  5. we synchronize the two at various points in the eval loop using _PyFrame_GetStackPointer() and _PyFrame_SetStackPointer()
  6. the stack also gets modified via _PyFrame_StackPush() and _PyFrame_StackPop()
    • in the eval loop (when returning from functions/generators)
    • in methods of the various kinds of generator (throw(), send(), __next__())
    • in the frameobj.f_lineno setter (the already mentioned frame_setlineno())
  7. the way the tracing machinery triggers in the eval loop means it can impact the frame state without the rest of the eval loop having a chance to synchronize
  8. the "no-cost exceptions" implementation did not take that into account, instead assuming the eval loop had kept stack_pointer in sync

Is all that correct?

Related questions:

  • how sure are we that exception_unwind is the only eval loop code that suffers from (8)?
  • is the tracing machinery the only (realistic) place where code outside the eval loop can cause the local vars (e.g. stack_pointer) to get out of sync with the frame?
  • is the problem, (8), related to all the tracing entry points (TRACE_FUNCTION_ENTRY, etc.) or just one of them (I'm looking at you TRACE_FUNCTION_UNWIND)?
  • would we need to address similar concerns with DTRACE_FUNCTION_*?

Based on all that, the possible solutions I see:

  1. do stack_pointer = _PyFrame_GetStackPointer(frame); in exception_unwind, as proposed in the PR
  2. do it a few lines up, right after TRACE_FUNCTION_UNWIND() (if that's the problematic tracing hook)
  3. do it in TRACE_FUNCTION_ENTRY(), TRACE_FUNCTION_EXIT(), etc. (or the related functions)
  4. do it only in specific ones where needed

At the very least, solution (1) is an effective stopgap, if not a valid long-term solution. The answers to the above questions would help determine the validity of other possible solutions, as well as if solution (1) is sufficient long-term.

(It would also be good to know what the performance impact of the various possible solutions are.)

@ericsnowcurrently
Copy link
Member

Should we be using SET_LOCALS_FROM_FRAME() instead of just setting stack_pointer?

@tiran
Copy link
Member

tiran commented Jul 7, 2022

Your summary matches my understanding of the problem. In fact I now understand the problem much better. Thank you!

  • totally unsure, other areas could be affected. Since we neither see ref leaks nor crashes, it is unlikely, though.
  • I guess tracing, profiling and frame manipulation could affect the ceval loop. We saw crashes when caused by lineno modification of frames.
  • frame_setlineno does only permit lineno manipulation from a PyThreadState_GET()->tracing_what == PyTrace_LINE event or PyTrace_RETURN for suspended frames.
  • DTRACE_FUNCTION_* is not affected. They are userspace probes (one-way event notification API) that sends file name, function name, and line number to DTRACE or systemtap Kernel API.

@ericsnowcurrently
Copy link
Member

I've opened gh-94666 for discussion of the broader concern about keeping stack_pointer in sync with the frame, which is ultimately the cause of this bug.

tiran added a commit to tiran/cpython that referenced this issue Jul 7, 2022
Zero-cost exception handling did not take ``frame_setlineno()`` into
account. It can pop off stacks. This can lead to a crash.

Exception unwinding now re-calculates the stack pointer.

Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com>
tiran pushed a commit that referenced this issue Jul 8, 2022
* Re-enable crasher
* Fix error handling for line-tracing events
* blurb add
tiran pushed a commit to tiran/cpython that referenced this issue Jul 8, 2022
…thonGH-94681)

* Re-enable crasher
* Fix error handling for line-tracing events
* blurb add
(cherry picked from commit 23ee4a8)

Co-authored-by: Brandt Bucher <brandtbucher@microsoft.com>
tiran added a commit that referenced this issue Jul 8, 2022
GH-94688)

* Re-enable crasher
* Fix error handling for line-tracing events
* blurb add
(cherry picked from commit 23ee4a8)

Co-authored-by: Brandt Bucher <brandtbucher@microsoft.com>
@tiran
Copy link
Member

tiran commented Jul 8, 2022

Main branch and 3.11 branch are fixed. Older branches are not affected. This issue is no longer a release blocker. 🎉

I'm leaving the ticket open for @markshannon to verify the patch after he is back.

@markshannon
Copy link
Member

The changes look good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes type-crash A hard crash of the interpreter, possibly with a core dump
Projects
Development

No branches or pull requests

7 participants