using 100% trying to acquire lock #15

djay · 2023-08-22T10:26:01Z

We are seeing instances randomly go to 100%

With a bit of py-spy we managed to track it down to

haufe.requestmonitoring/haufe/requestmonitoring/monitor.py

Line 80 in 3e1af58

_lock.acquire()

Most likely related we are also seeing this exception which is likely to be related.

ValueError: too many values to unpack (expected 2)
  File "ZPublisher/WSGIPublisher.py", line 165, in transaction_pubevents
    notify(pubevents.PubStart(request))
  File "/app/lib/python3.9/site-packages/zope/event/__init__.py", line 32, in notify
    subscriber(event)
  File "/app/eggs/zope.component-5.0.1-py3.9.egg/zope/component/event.py", line 27, in dispatch
    component_subscribers(event, None)
  File "/app/eggs/zope.component-5.0.1-py3.9.egg/zope/component/_api.py", line 134, in subscribers
    return sitemanager.subscribers(objects, interface)
  File "/app/eggs/zope.interface-5.4.0-py3.9-linux-x86_64.egg/zope/interface/registry.py", line 448, in subscribers
    return self.adapters.subscribers(objects, provided)
  File "/app/eggs/zope.interface-5.4.0-py3.9-linux-x86_64.egg/zope/interface/adapter.py", line 899, in subscribers
    subscription(*objects)
  File "app/eggs/plone.rest-2.0.0a5-py3.9.egg/plone/rest/events.py", line 32, in subscriber_mark_as_api_request
    mark_as_api_request(
  File "app/eggs/plone.rest-2.0.0a5-py3.9.egg/plone/rest/events.py", line 18, in mark_as_api_request
    service_id = lookup_service_id(method, accept)
  File "app/eggs/plone.rest-2.0.0a5-py3.9.egg/plone/rest/negotiation.py", line 26, in lookup_service_id
    media_types = parse_accept_header(accept)
  File "app/eggs/plone.rest-2.0.0a5-py3.9.egg/plone/rest/negotiation.py", line 16, in parse_accept_header
    type_, subtype = media_type.split("/")
KeyError: 3738
  File "waitress/channel.py", line 428, in service
    task.service()
  File "waitress/task.py", line 168, in service
    self.execute()
  File "waitress/task.py", line 434, in execute
    app_iter = self.channel.server.application(environ, start_response)
  File "ZPublisher/httpexceptions.py", line 30, in __call__
    return self.application(environ, start_response)
  File "ZPublisher/WSGIPublisher.py", line 375, in publish_module
    with transaction_pubevents(request, response):
  File "contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "ZPublisher/WSGIPublisher.py", line 221, in transaction_pubevents
    notify(pubevents.PubFailure(request, exc_info, retry))
  File "/app/lib/python3.9/site-packages/zope/event/__init__.py", line 32, in notify
    subscriber(event)
  File "/app/eggs/zope.component-5.0.1-py3.9.egg/zope/component/event.py", line 27, in dispatch
    component_subscribers(event, None)
  File "/app/eggs/zope.component-5.0.1-py3.9.egg/zope/component/_api.py", line 134, in subscribers
    return sitemanager.subscribers(objects, interface)
  File "/app/eggs/zope.interface-5.4.0-py3.9-linux-x86_64.egg/zope/interface/registry.py", line 448, in subscribers
    return self.adapters.subscribers(objects, provided)
  File "/app/eggs/zope.interface-5.4.0-py3.9-linux-x86_64.egg/zope/interface/adapter.py", line 899, in subscribers
    subscription(*objects)
  File "haufe/requestmonitoring/monitor.py", line 182, in handle_request_end
    account_request(event.request, True)
  File "haufe/requestmonitoring/monitor.py", line 63, in account_request
    del _state[id]

But I can't yet work out why the finally wouldn't work in releasing the the lock. or perhaps exception is symptom rather than a cause

The text was updated successfully, but these errors were encountered:

djay · 2023-08-22T10:37:33Z

my reading of finally without an except is that it will never raise the exception because it gets thrown away. Thats obviously not the case. Either way I suspect changing the code to

    try:
        if end:
            del _state[id]
        else:
            _state[id] = Request(id, info, request, ticket.time, get_ident())
    except:
       _lock.release()
       raise
    finally:
        _lock.release()

Might solve the lock issue. But still to understand how the state gets a key error

djay · 2023-08-22T10:45:38Z

Ah ok. the cause of the keyerror is because the original exception is happening in another subscriber called first. But end request handler is still called regardless. So the fix code could be

    try:
        if end:
            if id in _state:
                del _state[id]
        else:
            _state[id] = Request(id, info, request, ticket.time, get_ident())
    except:
       _lock.release()
       raise
    finally:
        _lock.release()

djay · 2023-08-23T02:05:52Z

I've changed my mind. Everything says that python guarantees finally is called. so I can't think of any way the del _state[id] exception can be the cause of the lock not being released. Can you @zopyx ?
So the only other explanation is the _state.copy() had an exception so I've wrapped that. The problem is I can find no evidence for this in our logs. e.g. there are no errors "monitor thread died with exception".
Will see if this PR fixes things or not.

zopyx · 2023-08-23T03:09:02Z

Sorry, no idea...ask Dieter

d-maurer · 2023-08-23T11:45:17Z

...
except:
_lock.release()
raise
finally:
_lock.release()

The release in the except block is wrong: the finally will try again to release the lock and fail (because it is not in an acquired state).

d-maurer · 2023-08-23T12:09:14Z

In principle, the lock management should be correct: the lock is acquired (and released in a finally block) in account_request (called by worker threads) and in _Monitor.run (called by the monitor thread). It protects simple dict operations (shallow copy, adding an item, removing an item). The only potentially critical operation is the item deletion; it may call arbitrary destructor code, potentially misbehaving one (causing a long (potentially infinite) delay before the release).

A note in ZPublisher.interfaces.IPubFailure reveals a potential cause for the KeyError:

    Note: If a subscriber to 'IPubSuccess' raises an exception,
    then 'IPubFailure' may be notified in addtion to 'IPubSuccess'.

i.e. it is possible that there are 2 IPubEnd events for the same request. This may cause the observed KeyError.

djay · 2023-08-23T13:38:12Z

@d-maurer the keyerror is not the problem. I've fixed that in the referenced pr.

The problem is that I'm getting 100% CPU from a thread stuck in line 80 trying to acquire a lock and I can't see how your code is wrong to cause a lock to not be released that results in this kind of deadlock.

jsbueno · 2023-08-23T13:48:31Z

my reading of finally without an except is that it will never raise the exception because it gets thrown away. Thats obviously not the case. Either way I suspect changing the code to
    try:
        if end:
            del _state[id]
        else:
            _state[id] = Request(id, info, request, ticket.time, get_ident())
    except:
       _lock.release()
       raise
    finally:
        _lock.release()
Might solve the lock issue. But still to understand how the state gets a key error
Hi - sorry for pinging in without all the context - it looks like you gave up this approach, so ok.

I am just writing this to confirm that code like this would be essentially wrong, and .release() would be called twice in a row when any exception occurred. The finally clause will always run.

d-maurer · 2023-08-23T14:13:34Z

Dylan Jay wrote at 2023-8-23 06:38 -0700:

... The problem is that I'm getting 100% CPU from a thread stuck in line 80 trying to acquire a lock and I can't see how your code is wrong to cause a lock to not be released that results in this kind of deadlock.

I explained the only thing I see which could block the lock release for some time: misbehaving destructors activated during item deletion. I do not think the probability is high because the request (the main critical object) is not destroyed at this point (but keep alive until the request is closed). Thus, I too, do not have a convincing idea.

djay · 2023-08-24T10:08:42Z

@d-maurer I'm not sure I understand how a destructor might hold a lock?

My reasoning is the only thing that could cause the lock to not be released is an exception in

haufe.requestmonitoring/haufe/requestmonitoring/monitor.py

Line 81 in 3e1af58

pending = _state.copy()

Since that has no try: finally:. but no evidence this happened.

Also that means the loop would have been exited and there is no way for it to get stuck on line 80...

By destructors during item deletion do you mean that you think this could happen in when the state is deleted?

haufe.requestmonitoring/haufe/requestmonitoring/monitor.py

Line 62 in 3e1af58

del _state[id]

however finally should handle the release of the lock so would not cause the deadlock.

Anywhere else a destructor would not result in a lock not being released so wouldnt be a problem.

So I'm still stumped.

djay · 2023-08-24T10:27:50Z

@d-maurer actually I just did a test. lock.acquire doesn't use hardly any cpu so it can't be the cause. it might be just that when I did py-spy dump its what I keep seeing as the line it's stuck on because it's the slowest.

Instead it could be just running that loop fast with no sleep. but the only way that could be is if somehow config.period get reset to 0 halfway through the life of the zope process???

https://github.com/collective/haufe.requestmonitoring/blob/3e1af58eeeb014a3f94aa3fc37893f68cdbf4dd2/haufe/requestmonitoring/monitor.py#L79C1-L79C1

I just can't see how that could happen.

d-maurer · 2023-08-24T11:32:44Z

Dylan Jay wrote at 2023-8-24 03:08 -0700:

@d-maurer I'm not sure I understand how a destructor might hold a lock?

`account_request` is holding the lock while it executes `del state[id]`. The `del` may invoke arbitrary destructors and potentially one of them takes a lot of time.

My reasoning is the only thing that could cause the lock to not be released is an exception in https://github.com/collective/haufe.requestmonitoring/blob/3e1af58eeeb014a3f94aa3fc37893f68cdbf4dd2/haufe/requestmonitoring/monitor.py#L81

`state` is a `dict`; `state.copy()` creates a shallow copy. The only possible exception should be a `MemoryError`.

…

-- Dieter

djay · 2023-08-31T05:41:41Z

@d-maurer none of this explains what I'm seeing so still stumped.

No matter how many times I run py-spy on all of the running instances that switch to 100% cpu I get the same result. Stuck on the lock.acquire on line 80. No other active threads most of the time. and acquire should use no cpu if it's blocked and it has no timeout so it's not running in a tight loop.

Thread 0x7EFF19C52740 (active+gil): "MainThread"
    readable (waitress/server.py:266)
    poll (waitress/wasyncore.py:158)
    loop (waitress/wasyncore.py:245)
    run (waitress/server.py:322)
    serve (waitress/__init__.py:19)
    serve_paste (plone/recipe/zope2instance/ctl.py:917)
    serve (plone/recipe/zope2instance/ctl.py:942)
    serve (Zope2/Startup/serve.py:203)
    run (Zope2/Startup/serve.py:217)
    main (Zope2/Startup/serve.py:251)
    <module> (Zope2/Startup/serve.py:255)
    <module> (interpreter:323)
Thread 0x7EFF088E9700 (active): "Dummy-1"
    run (haufe/requestmonitoring/monitor.py:80)

100.00% 100.00%   13.00s    13.00s   run (haufe/requestmonitoring/monitor.py)
 40.00%  40.00%    5.16s     5.16s   readable (waitress/server.py)

So none of it makes sense.

But the PR at least fixes the problem of masking of other errors that happen in other subscribers. Once I get it into production we will see if helps with the 100% cpu issue or not, despite that in theory it should not.

d-maurer · 2023-08-31T06:49:55Z

Dylan Jay wrote at 2023-8-30 22:41 -0700:

... But the PR at least fixes the problem of masking of other errors that happen in other subscribers.

I have not looked at your PR. But, if it not already does, it could close the small hole where an exception in `dict.copy` (only `MemoryError` should be possible) prevents the lock release. Enclose the `copy` in a `try: ... finally:` with the lock release in the `finally` clause. I do not expect that this is the cause of your problem but nevertheless, it would be an improvement. I remember to have had analysis problems similar to yours (I observed a thread apparently continuously at a harmless place) in the past. In my case, the garbage collector was responsible: some third party component had created hundreds of thousand temporary objects causing frequent garbage collector activations. Running in "C", the garbage collector is not seen by stack analysis targeting Python code. During garbage collector runs, 100 % CPU usage is not surprising. Of course, such runs should end (but if you have a large number of objects, a complete GC turn can take some time).

djay · 2023-09-02T01:01:11Z

@d-maurer yes of course I wrapped the copy.

The 100% cpu goes on forever if not restarted. Interesting possible cause but don't think that can be it.

djay · 2023-09-04T03:03:22Z

@d-maurer after deployment the 100% cpu problem still happens.

Collecting samples from '/app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini' (python v3.9.5)amples 10391
Total Samples 10491
GIL: 42.00%, Active: 142.00%, Threads: 5

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                                 
100.00% 100.00%   104.9s    104.9s   run (haufe/requestmonitoring/monitor.py:78)
 42.00%  42.00%   39.64s    39.64s   readable (waitress/server.py:266)
  0.00%   0.00%   0.610s    0.610s   readable (waitress/server.py:274)
  0.00%  42.00%   0.400s    42.92s   loop (waitress/wasyncore.py:245)
  0.00%   0.00%   0.350s    0.350s   writable (waitress/channel.py:89)
  0.00%   0.00%   0.330s    0.460s   handle_write_event (waitress/wasyncore.py:517)
  0.00%   0.00%   0.320s     1.10s   write (waitress/wasyncore.py:113)
  0.00%   0.00%   0.170s    0.170s   execute (relstorage/adapters/sql/query.py:221)
  0.00%   0.00%   0.160s    0.160s   readable (waitress/channel.py:155)
  0.00%   0.00%   0.130s    0.130s   handle_write (waitress/channel.py:95)
  0.00%   0.00%   0.120s    0.120s   handle_write_event (waitress/wasyncore.py:514)
  0.00%   0.00%   0.110s    0.110s   rea


Thread 0x7F39883D4740 (active+gil): "MainThread"
    readable (waitress/server.py:266)
    poll (waitress/wasyncore.py:158)
    loop (waitress/wasyncore.py:245)
    run (waitress/server.py:322)
    serve (waitress/__init__.py:19)
    serve_paste (plone/recipe/zope2instance/ctl.py:917)
    serve (plone/recipe/zope2instance/ctl.py:942)
    serve (Zope2/Startup/serve.py:203)
    run (Zope2/Startup/serve.py:217)
    main (Zope2/Startup/serve.py:251)
    <module> (Zope2/Startup/serve.py:255)
    <module> (interpreter:321)
Thread 0x7F39770AB700 (active): "Dummy-1"
    run (haufe/requestmonitoring/monitor.py:78)
Thread 0x7F39768AA700 (idle): "waitress-0"
    wait (threading.py:312)
    handler_thread (waitress/task.py:72)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 0x7F39760A9700 (idle): "waitress-1"
    wait (threading.py:312)
    handler_thread (waitress/task.py:72)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 0x7F39758A8700 (idle): "raven-sentry.BackgroundWorker"
    wait (threading.py:312)
    get (sentry_sdk/_queue.py:240)
    _target (sentry_sdk/worker.py:127)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)

except now the line it's hanging on is sleep(self.period).
maybe it was before and I was just looking at the wrong version of the code

            while 1:
                sleep(self.period)
                with _lock:
                    pending = _state.copy()
                if not pending:
                    continue
                monitorTime = time()

but this is just as mysterious since I sleep shouldn't be using any cpu. Unless somehow py-spy is lumping GIL/async stuff into the sleep call and something else is causing the cpu usuage?

or perhaps the other explanation is it's a super tight loop and period is very small but still big enough that the thread always seems to end up on that line.

d-maurer · 2023-09-04T05:51:49Z

Dylan Jay wrote at 2023-9-3 20:03 -0700:

@d-maurer after deployment the 100% cpu problem still happens.

That was to be expected: there are really very few conditions for a `dict.copy` failure.

... lumping GIL/async stuff into the sleep call and something else is causing the cpu usuage?

That depends how "py-spy" determines its data. The "py-spy" output suggests that it is based on "samples" (--> "Samples 10491"), i.e. snapshots. It is natural that snapshots of `Monitor.run` will typically see it in the `sleep` call (where is is almost all the time). You have not yet told us (I think) on what platform you are observing the problem. Should you see it on Linux, then threads are mapped to processes and you should be able to find out which process is continously running. Hopefully, you can tell "py-spy" to include the process/thread relation into its output (or add appropriate diagnostic output) to find out which thread is responsible. In this way, you might be able to determine whether the monitor is responsible or not.

djay · 2023-09-04T07:28:40Z

@d-maurer I included the thread dump but in an edit so you wouldn't have seen it in the notification.


Thread 0x7F39883D4740 (active+gil): "MainThread"
    readable (waitress/server.py:266)
    poll (waitress/wasyncore.py:158)
    loop (waitress/wasyncore.py:245)
    run (waitress/server.py:322)
    serve (waitress/__init__.py:19)
    serve_paste (plone/recipe/zope2instance/ctl.py:917)
    serve (plone/recipe/zope2instance/ctl.py:942)
    serve (Zope2/Startup/serve.py:203)
    run (Zope2/Startup/serve.py:217)
    main (Zope2/Startup/serve.py:251)
    <module> (Zope2/Startup/serve.py:255)
    <module> (interpreter:321)
Thread 0x7F39770AB700 (active): "Dummy-1"
    run (haufe/requestmonitoring/monitor.py:78)
Thread 0x7F39768AA700 (idle): "waitress-0"
    wait (threading.py:312)
    handler_thread (waitress/task.py:72)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 0x7F39760A9700 (idle): "waitress-1"
    wait (threading.py:312)
    handler_thread (waitress/task.py:72)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 0x7F39758A8700 (idle): "raven-sentry.BackgroundWorker"
    wait (threading.py:312)
    get (sentry_sdk/_queue.py:240)
    _target (sentry_sdk/worker.py:127)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)

I think thats what you mean. if not, not sure what kind of process dump you mean. top just says python.

I did look at the sentry thread to see if it could be related. it's idle on this line - ttps://github.com/python/cpython/blob/3.9/Lib/threading.py#L312

The platform is linux running inside docker.

It is natural that snapshots of Monitor.run will typically
see it in the sleep call (where is is almost all the time).

True, but I wouldn't except to see it at 100% CPU in top and 100% in py-spy...

d-maurer · 2023-09-04T10:02:25Z

Dylan Jay wrote at 2023-9-4 00:28 -0700:

@d-maurer I included the thread dump but in an edit so you wouldn't have seen it in the notification. ... Thread 0x7F39770AB700 (active): "Dummy-1" run (haufe/requestmonitoring/monitor.py:78)

It apparently tells us that `monitor.run` is active. As you wrote, it should not be active in a `sleep` (at least not for a longer period). You you verify that the sleep time (i.e. the monitor's `period`) still is reasonable. For a client, I had created a set of views which allows inline debugging of a running Zope process -- similar to what the so called "monitor server" of `ZServer` allowed (but safer as I used it in a production environment). They allow (under special setup conditions) to interactively [re]import modules and call functions/views. Something like this might help you to analyse the state.

... I think thats what you mean. if not, not sure what kind of process dump you mean. top just says python.

Not precisely: Processes are identified by process ids (i.e. small integers). Many Linux tools work with those process ids (e.g. "kill", "ps", "top", "strace").

... True, but I wouldn't except to see it at 100% CPU in top and 100% in py-spy...

Once you have the process id of the 100 % CPU eater, you can use `strace -p` to learn which system services it calls (not sure how easy this will be with `docker`). You can also try to kill this process with various signals (not sure this will work: when I remember right, Python handles all signals in the main thread).

djay · 2023-09-05T03:07:21Z

@d-maurer I understand processes (or think I did) but single process is all the threads. hense why I used py-spy.
Zope used to dump the threads when you send it sigusr1 but that is zserver not waitress and waitress doesn't have the same feature. but py-spy should give the same result I believe? or I guess it's sampling so maybe not...

This is what two of the waitress instances look like from the os level.

 167393 ubuntu    20   0  940548 375200   8928 R 100.3   2.3   1334:40 python3.9                                                                                             
2274489 ubuntu    20   0  739124 375896  10416 R 100.0   2.3   1149:26 python3.9

strace gives us a lot of

select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999997})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728355736}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718362989}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728501947}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718486917}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728620701}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718605166}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728737692}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718721432}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728853086}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718837226}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728969206}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718953381}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=729085570}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=719069606}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=729201195}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=719206793}) = 0

Does that tell is anything?

djay · 2023-09-05T03:13:55Z

@d-maurer if I look at one of the normal instances I get something similar but much slower output.

strace: Process 3051017 attached
select(17, [13 16], [], [13 16], {tv_sec=0, tv_usec=187813}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883492, tv_nsec=923946154}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392896, tv_nsec=914070438}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883493, tv_nsec=925595671}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392897, tv_nsec=915645056}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883494, tv_nsec=926326532}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392898, tv_nsec=916442025}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 1 (in [13], left {tv_sec=0, tv_usec=540079})
accept4(13, {sa_family=AF_INET, sin_port=htons(39856), sin_addr=inet_addr("10.42.105.85")}, [16], SOCK_CLOEXEC) = 8
getsockname(8, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("10.42.113.157")}, [128->16]) = 0
setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=387151742}) = 0
getsockopt(8, SOL_SOCKET, SO_SNDBUF, [46080], [4]) = 0
ioctl(8, FIONBIO, [1])                  = 0
getpeername(8, {sa_family=AF_INET, sin_port=htons(39856), sin_addr=inet_addr("10.42.105.85")}, [16]) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=387702332}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392899, tv_nsec=377760856}) = 0
select(17, [8 13 16], [], [8 13 16], {tv_sec=1, tv_usec=0}) = 1 (in [8], left {tv_sec=0, tv_usec=999997})
recvfrom(8, "GET /++api++/@subfooter HTTP/1.1"..., 8192, 0, NULL, NULL) = 601
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=388329229}) = 0
futex(0x7f5c601b11a0, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=388931506}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392899, tv_nsec=378963078}) = 0
futex(0x9a5dec, FUTEX_WAKE_PRIVATE, 1)  = 1
futex(0x9a5df0, FUTEX_WAKE_PRIVATE, 1)  = 1
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 1 (in [16], left {tv_sec=0, tv_usec=730065})
read(16, "x", 8192)                     = 1
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=660095688}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392899, tv_nsec=650290069}) = 0
select(17, [13 16], [8], [8 13 16], {tv_sec=1, tv_usec=0}) = 1 (out [8], left {tv_sec=0, tv_usec=999996})
close(8)                                = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=660811840}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392899, tv_nsec=650884792}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883496, tv_nsec=662340141}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392900, tv_nsec=652385383}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883497, tv_nsec=663832251}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392901, tv_nsec=653892531}) = 0

so perhaps it does look like somehow that sleep delay is somehow getting reset down to some really small number like 0 and then we get a tight loop?

d-maurer · 2023-09-05T06:38:24Z

Dylan Jay wrote at 2023-9-4 20:07 -0700:

@d-maurer I understand processes (or think I did) but single process is all the threads. hense why I used py-spy.

You are right. Apparently, a thread is no longer (since Linux 2.4) identified by a pid, but now has an identifying `tid` (obtained via the system call `gettid` and shown as `SPID` for `ps -Tf` and as `LWP` for `ps -Lf`). You likely could call `gettid` via `ctypes` to find out the tid of the monitor thread and one of the `ps` thread options to determine whether the monitor thread is really the one eating all the CPU time. You can use `strace` (with the `-p` option) to trace a thread. This way, you can determine which system calls the CPU eating thread performs. Maybe, this provides sufficient hints.

... but py-spy should give the same result I believe? or I guess it's sampling so maybe not...

I have used "py-spy" only once (to analyse an apparent memory leak). But we both agree that what "py-spy" tells us (a "sleep" using 100 % CPU) should not be possible. Thus, either "py-spy" is mistaken or something really strange happens.

This is what two of the waitress instances look like from the os level. ...

This looks like `top` output. You might be able to add a "tid" output column.

strace gives us a lot of select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999997})

This looks like a system call from the IO thread (waiting on 2 input and one output channel with a 1 s timeout returning because output is possible). You can add a time column to the `strace` command to better understand what goes on. You would want to `strace` the thread eating all the CPU time, not necessarily the IO thread.

d-maurer · 2023-09-05T06:55:36Z

Dylan Jay wrote at 2023-9-4 20:14 -0700:

@d-maurer if I look at one of the normal instances I get something similar but much slower output. ``` strace: Process 3051017 attached select(17, [13 16], [], [13 16], {tv_sec=0, tv_usec=187813}) = 0 (Timeout)

Those `select`s are terminated by a timeout, not by possible output. In your previous `strace` output, the `select` reported possible output but was not followed by a corresponding `write/send` operation. If the `select` is called again, it will see the same situation and immediately return -- resulting in 100 % CPU usage. Your description above ("much slower") indicates that the `select`s follow rapidly in the 100 % CPU usage case. You can add a time column to the `strace` output and quantify this. I think you have found the 100 % CPU usage cause: `select` reports "output possible" but the output is not performed. You now must identify which thread/component is responsible (I suspect the IO thread). I think the monitor thread not responsible: while the `sleep` might be implemented via `select`, it is unlikely that it wants to perform output.

djay · 2023-09-05T07:13:49Z

@d-maurer thank you. now I have something to go on.

btw https://unixhealthcheck.com/blog?id=465 says how to get the TID and strace -fp TID seems to work. just got to wait for it to trip into 100% cpu mode to try it. but not sure how match the TID back to the py-spy output to see what python code is responsible. still not sure how if py-spy says all the other threads are idle we could having this select loop. Or perhaps its still in the monitor thread but the sleep is switching back to the asyc stuff and that would be doing the select stuff.

d-maurer · 2023-09-05T08:15:19Z

Dylan Jay wrote at 2023-9-5 00:13 -0700:

... still not sure how if py-spy says all the other threads are idle we could having this select loop.

I am already convinced that you can forget "py-spy" for this case: its sampling rate is not high enough and therefore, its output is not reliable.

Or perhaps its still in the monitor thread but the sleep is switching back to the asyc stuff and that would be doing the select stuff.

Why should `sleep` do that? It would mean that `waitress` has replaced the standard `sleep` by something very strange (the "async world" has its own `sleep`, there is no real need to replace the standard `sleep`). Even if `sleep` is connected to the "async stuff", if the `select` reports possible IO, the IO must be handled before the next identical `select` -- otherwise, we get an active infinite loop.

d-maurer · 2023-09-06T04:56:25Z

Dylan Jay wrote at 2023-9-4 20:07 -0700:

... strace gives us a lot of ... select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999997}) clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728355736}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718362989}) = 0

I have looked at the `sleep` implementation (in Python 3.10 --> "Modules/timemodule.c:pysleep"). It implements `sleep` via err = select(0, (fd_set *)0, (fd_set *)0, (fd_set *)0, &timeout); This means: the `select` we see above does not come from `time.sleep`.

djay · 2023-09-11T04:27:51Z

@d-maurer I got an strace with just the threadid that is high cpu and relative timestamps but I don't think it tells us much

[pid 3339237]      0.000094 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706467631}) = 0
[pid 3339237]      0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=345697557}) = 0
[pid 3339237]      0.000076 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
[pid 3339237]      0.000095 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706725008}) = 0
[pid 3339237]      0.000085 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=345955002}) = 0
[pid 3339237]      0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
[pid 3339237]      0.000093 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706980762}) = 0
[pid 3339237]      0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=346210463}) = 0
[pid 3339237]      0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
[pid 3339237]      0.000093 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=707235936}) = 0
[pid 3339237]      0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=346465653}) = 0
[pid 3339237]      0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
[pid 3339237]      0.000094 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=707492042}) = 0
[pid 3339237]      0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=346721763}) = 0
[pid 3339237]      0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})

I couldn't get py-spy to tell us the threadid due to docker issues so that way didn't tell us python code might be causing this.

I did find another sampling profiler I could get to work but it gives a completely different answer.


  Austin  TUI   Wall Time Profile                                                                                             CPU  99% ▇▆▇▇▇▇█▇   MEM 360M ████████    1/7
   _________   Command /app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini
   ⎝__⎠ ⎝__⎠   Python 3.9.0     PID 3339237     PID:TID 3339237:31          
               Samples 62252    ⏲️   12"         Threshold 0%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                                                                     
  00"     12"      0.0%   99.8%  └─ _bootstrap (/usr/lib/python3.9/threading.py:912)                                                                                         │
  00"     12"      0.0%   99.8%     └─ _bootstrap_inner (/usr/lib/python3.9/threading.py:954)                                                                                │
  00"     12"      0.0%   99.8%        └─ run (/app/lib/python3.9/site-packages/sentry_sdk/integrations/threading.py:70)                                                     │
  00"     12"      0.0%   99.8%           └─ run (/app/eggs/plone.cachepurging-3.0.0a1-py3.9.egg/plone/cachepurging/purger.py:170)                                           │
  00"     12"      0.0%   99.8%              └─ get (/usr/lib/python3.9/queue.py:171)                                                                                        │
  12"     12"     99.8%   99.8%                 └─ wait (/usr/lib/python3.9/threading.py:312)                                                                                │
                                                                                                                                                                             │

but this seems equally unlikely since a wait should also be no cpu.
It is another worker thread I wasn't aware of in
https://github.com/plone/plone.cachepurging/blob/3.0.0a1/plone/cachepurging/purger.py#L170.

One suspicious thing might be the sentry-sdk code as I think we might not have got this issue before we installed sentry... but I'd have to go back and check.
But looking at the code I don't see what it could be doing to cause this.

https://github.com/getsentry/sentry-python/blob/a6e1cbe1e8daf0ff5532ad4dd7eea17b086bfe9d/sentry_sdk/integrations/threading.py#L70

So I still wonder if somehow something happing in c code is running while these threads have released the GIL and these sampling profilers somehow associate that activity with the waiting thread. I don't know how possible that is, or another way to get a more accurate picture of whats python code is the cause.
Previously we looked at the logs to see if there was any correlation between the requests before the cpu went up and couldn't see anything that stands out...

.

davisagli · 2023-09-11T04:37:25Z

I wonder if getting a stack trace using gdb would give any clues about what's happening outside of the Python stack. https://wiki.python.org/moin/DebuggingWithGdb

d-maurer · 2023-09-11T05:36:44Z

Dylan Jay wrote at 2023-9-10 21:28 -0700:

@d-maurer I got an strace with just the threadid that is high cpu and relative timestamps but I don't think it tells us much ``` [pid 3339237] 0.000094 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706467631}) = 0 [pid 3339237] 0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=345697557}) = 0 [pid 3339237] 0.000076 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997}) [pid 3339237] 0.000095 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706725008}) = 0 [pid 3339237] 0.000085 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=345955002}) = 0 [pid 3339237] 0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})

It tells us that the problem is in an IO thread: `select` reports ready output but not output is tried (no "write/send") before the next "select". The IO thread is likely controlled by `waitress` (`ZEO`, too, has IO threads but they usually talk only to a single output channel, the `select` above shows 4). Potentially, an exception has intervened preventing the output operation. I would try to learn how `waitress` reports loop callback exceptions and how to ensure that you would see such reports.

... One suspicious thing might be the sentry-sdk code as I think we might not have got this issue before we installed sentry... but I'd have to go back and check.

You are looking for an IO thread which talks over several output channels. Not sure whether `sentry-sdk` uses such a thread. If something knows running in an "async" environment, then the IO thread may not itself be responsible: the "something" might have put a buggy output callback into the loop. But the IO threads exception reporting should nevertheless give hints.

... So I still wonder if somehow something happing in c code is running while these threads have released the GIL and these sampling profilers somehow associate that activity with the waiting thread.

This might be possible. But, it should be quite rare that C code is interested in IO on 2 input and 4 output channels. I would expect typical C code (e.g. for database connections) to handle a single connection. Potentially interesting: the input "fileno"s referenced in the `select` are small (likely opened very early) while the output "fileno"s are quite large (likely opened significantly later). Another potentially interesting analysis approach: if you have set up your system appropriately, you can attach a running process with `gdb` and analyze its state at the C level interactively. Once the 100 % situation occurs, I would attach `gdb` to the Zope process, put a breakpoint on "select" and analyze the (C level) traceback of the thread stopped by the breakpoint. At least, this should show whether or not the problem comes from C code. There is a `.gdbinit` (somewhere) which defines `gdb` commands allowing to analyze the Python state (e.g. the callback structure) from C level (in the case that the `select` comes from Python). I see aother approach: You can guess which IO thread might be responsible (my guess is the one controlled by `waitress`) and verify the guess by introducing diagnostic output into this thread (e.g. logging the `gettid` value (called via `ctype`)). You can then compare this "TID" to the "PID" of the 100 % CPU thread. My guess above comes from the "fileno"s referenced in the "select". Using the "tid"s might provide another guess approach: the "tid" sequence reveals the order in which the threads have been created. The "main thread" will always have the "process id" as "tid" (the primary info shown by e.g. "ps"). I do not know how Linux assigns the "tid"s but I suspect it will use the first "available" number (tasks and processes share the name domain: small integers). Thus, on a system started recently, a task started earlier will likely get a smaller "tid". This will allow you to guess the order in which the tasks have been started. And this order will be quite stable (at least for the threads started while Zope is starting up). If you know the start index of the problematic thread, you can analyse it outside your production environment (or `strace` it in your production enviroment before it enters the 100 % CPU usage to lean what it does under "normal" conditions).

djay · 2023-09-11T05:41:12Z

@d-maurer @d-maurer I was using austin wrong. You can move through the threads and the one with the internal threadid that corresponded with strace was this thread

  Austin  TUI   Wall Time Profile                                                                                             CPU  98% ▇█▇▇▇▇█▇   MEM 366M ████████    7/7
   _________   Command /app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini
   ⎝__⎠ ⎝__⎠   Python 3.9.0     PID 3339237     PID:TID 3339237:12          
               Samples 2374164  ⏲️   5'05"       Threshold 0%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                                                                     
  00"    5'05"     0.0%  100.0%  <module> (/app/parts/instance/bin/interpreter:321)                                                                                          │
  00"    5'05"     0.0%  100.0%  <module> (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:255)                                                                          │
  00"    5'05"     0.0%  100.0%  main (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:251)                                                                              │
  00"    5'05"     0.0%  100.0%  run (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:217)                                                                               │
  00"    5'05"     0.0%  100.0%  serve (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:203)                                                                             │
  00"    5'05"     0.0%  100.0%  serve (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:942)                                         │
  00"    5'05"     0.0%  100.0%  serve_paste (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:917)                                   │
  00"    5'05"     0.0%  100.0%  serve (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/__init__.py:19)                                                                          │
  00"    5'05"     0.0%  100.0%  run (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:322)                                                                             │
  03"    5'05"     0.9%   99.9%  loop (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:245)                                                                         │
  17"     36"      5.7%   11.7%  poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:159)                                                                         │
  02"     02"      0.7%    0.7%  readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:290)                                                                        │

EDIT: note this thread wasn't stuck. it was switching between reading and polling etc. if you match it to py-spy its the one that said "Thread 0x7F39883D4740 (active+gil): "MainThread".

You can see the internal thread id here (by running top -H inside the running container)

top - 15:39:32 up 10 days, 15:57,  0 users,  load average: 1.22, 1.69, 1.88
Threads:  10 total,   2 running,   8 sleeping,   0 stopped,   0 zombie
%Cpu(s): 23.5 us,  3.8 sy,  0.0 ni, 71.5 id,  0.0 wa,  0.0 hi,  1.0 si,  0.2 st
MiB Mem :  15986.9 total,    415.6 free,  13702.8 used,   1868.5 buff/cache
MiB Swap:  16384.0 total,  15806.7 free,    577.2 used.   1949.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                               
     12 plone     20   0 1127736 346516  14464 R  99.0   2.1   1785:12 python3.9                                                                                             
     27 plone     20   0 1127736 346516  14464 S   1.7   2.1  14:17.64 python3.9                                                                                             
      1 plone     20   0    2616      0      0 S   0.0   0.0   0:00.03 sh                                                                                                    
     26 plone     20   0 1127736 346516  14464 S   0.0   2.1   0:03.91 python3.9                                                                                             
     28 plone     20   0 1127736 346516  14464 S   0.0   2.1  14:22.74 python3.9                                                                                             
     29 plone     20   0 1127736 346516  14464 S   0.0   2.1   0:58.88 python3.9                                                                                             
     30 plone     20   0 1127736 346516  14464 S   0.0   2.1   0:00.22 python3.9                                                                                             
     31 plone     20   0 1127736 346516  14464 S   0.0   2.1   0:00.22 python3.9                                                                                             
     32 plone     20   0    4116   3044   2668 S   0.0   0.0   0:00.02 bash                                                                                                  
     42 plone     20   0    6124   3132   2632 R   0.0   0.0   0:00.00 top

djay · 2023-09-11T06:08:48Z

@d-maurer @davisagli not sure why but I couldn't get gdb to load the python symbols

$ sudo docker run --rm --pid=host -it --cap-add=SYS_PTRACE -p 3000:3000 python:3.9.0 bash -c 'apt-get update && apt-get install gdb && gdb python  3339237'
...
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
Attaching to program: /usr/local/bin/python, process 3339237
[New LWP 3339497]
[New LWP 3339498]
[New LWP 3339499]
[New LWP 3342706]
[New LWP 611493]
[New LWP 611494]

warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.

warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
0x000000000059938c in ?? ()
(gdb) bt
#0  0x000000000059938c in ?? ()
#1  0x00000000022a4310 in ?? ()
#2  0x00007f80d71df050 in ?? ()
#3  0x00000000022a4358 in ?? ()
#4  0x00000000047226f8 in ?? ()
#5  0x00007f80d70b07d0 in ?? ()
#6  0x00007f80d70c6980 in ?? ()
#7  0x00007f80d70c7710 in ?? ()
#8  0x00007f80d71df07c in ?? ()
#9  0xac9d42930de1dc62 in ?? ()
#10 0x00007f80d7e74500 in ?? ()
#11 0x00007f80d71df1a8 in ?? ()
#12 0xffffffff00000000 in ?? ()
#13 0x0000000004722748 in ?? ()
#14 0x0000000000000250 in ?? ()
#15 0x0000000000000008 in ?? ()
#16 0x0000003800000023 in ?? ()
#17 0x00000000ffffffff in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb) py-bt
Undefined command: "py-bt".  Try "help".

instead I tried pdb-attach but it seemed to crash the process so now I have to wait some days before it go back to switch into this 100% mode again :(

sudo docker run --rm --pid=host -it --cap-add=SYS_PTRACE -p 30000:30000 python:3.9.0 bash -c 'pip install pdb-attach && python -m pdb_attach  3339237 30000'

d-maurer · 2023-09-11T06:16:50Z

Dylan Jay wrote at 2023-9-10 22:41 -0700:

@d-maurer @d-maurer I was using austin wrong. You can move through the threads and the one with the internal threadid that corresponded with strace was this thread ... 00" 5'05" 0.0% 100.0% run (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:322) 03" 5'05" 0.9% 99.9% loop (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:245) 17" 36" 5.7% 11.7% poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:159) 02" 02" 0.7% 0.7% readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:290)

Good! This shows: the main thread (i.e. `waitress`) is suffering the active loop. We now must find out why. The `strace` shows us: `select` reports possible `write` but no write is performed before the next `select`. `waitress.wasyncore.write` is responsible for the `write` operation: ```python def write(obj): try: obj.handle_write_event() except _reraised_exceptions: raise except: obj.handle_error() ``` I assume that the `handle_write_event` was prevented by an exception from performing the `write` and a bug in `handle_error` caused a cleanup failure (removing the object from `socket_map`). I suggest to change the code above to get diagnostic information about any exception in `handle_write_event` (which exception, object class, object).

djay · 2023-09-11T08:10:22Z

@d-maurer a little more info. This shows time spent on calls over a period of time, so it's not just static in time.

 Austin  TUI   Wall Time Profile                                                                                             CPU  99% ▇█▇█▇▇▇▇   MEM 263M ████████    5/5
   _________   Command /app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini
   ⎝__⎠ ⎝__⎠   Python 3.9.0     PID 3351466     PID:TID 3351466:11          
               Samples 1451365  ⏲️   3'16"       Threshold 0%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                                                                     
  00"    3'16"     0.0%  100.0%  └─ <module> (/app/parts/instance/bin/interpreter:326)                                                                                       ▒
  00"    3'16"     0.0%  100.0%     └─ <module> (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:255)                                                                    ▒
  00"    3'16"     0.0%  100.0%        └─ main (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:251)                                                                     ▒
  00"    3'16"     0.0%  100.0%           └─ run (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:217)                                                                   │
  00"    3'16"     0.0%  100.0%              └─ serve (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:203)                                                              │
  00"    3'16"     0.0%  100.0%                 └─ serve (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:942)                       │
  00"    3'16"     0.0%  100.0%                    └─ serve_paste (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:917)              │
  00"    3'16"     0.0%  100.0%                       └─ serve (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/__init__.py:19)                                                  │
  00"    3'16"     0.0%  100.0%                          └─ run (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:322)                                                  │
  05"    3'16"     2.5%   99.9%                             ├─ loop (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:245)                                           │
  36"     44"     18.3%   22.4%                             │  ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:158)                                        │
  05"     05"      2.4%    2.4%                             │  │  ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:290)                                    │
  01"     02"      0.4%    0.9%                             │  │  ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:117)                                    │
  01"     01"      0.4%    0.5%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  ├─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:95)                          │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:514)                    │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:515)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:150)                                  │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509)                    │
  00"     00"      0.0%    0.0%                             │  │  │     └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.1%                             │  │  ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:113)                                    │
  00"     00"      0.1%    0.1%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509)                    │
  00"     00"      0.1%    0.1%                             │  │  ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:154)

Before I go changing the waitress code in prod (since I don't know the cause so can't reproduce this), I'll look at the code and logs too see if anything could be causing what you describe.

d-maurer · 2023-09-11T08:55:58Z

Dylan Jay wrote at 2023-9-11 01:10 -0700:

... Before I go changing the waitress code in prod (since I don't know the cause so can't reproduce this), I'll look at the code and logs too see if anything could be causing what you describe.

One idea: potentially, you are seeing a DOS attack where the attacker closes the socket in a non standard way (e.g. only partially) and `waitress` gets in an inconsistent state in that case. We do not see a `write/send`, i.e. the inconsistency would get effective earlier (preventing the OS level output operation). When you change the code, try also to get at the traceback.

djay · 2023-09-11T10:11:03Z

@d-maurer looks like you are right

https://github.com/Pylons/waitress/blob/main/src/waitress/channel.py#L98

is visited a lot which means it's in a loop where its trying to write but can't because the socket internally marked as not in connected state.

    def handle_write(self):
        # Precondition: there's data in the out buffer to be sent, or
        # there's a pending will_close request

        if not self.connected:
            # we dont want to close the channel twice

            return

EDIT: seems it could be due to an empty read here https://github.com/Pylons/waitress/blob/main/src/waitress/channel.py#L176. I can't see how it gets out of the loop and actually closes the connection if it hits that line.

djay · 2023-09-11T10:50:47Z

I'm 90% sure thats the cause so I've created issue at Pylons/waitress#418 but not sure on the best solution.

d-maurer · 2023-09-11T11:46:31Z

Dylan Jay wrote at 2023-9-11 03:50 -0700:

I'm 90% sure thats the cause so I've created issue at Pylons/waitress#418 but not sure on the best solution.

The error is where `self.connected` was set tu `False`. There, it should have been ensured that the corresponding "fileno" is removed von `socket_map` and that it will not be put there again (as long as `self.connected` remains `False`). Something exceptional must have brought `waitress` into this state (otherwise, we would have lots of 100 % CPU usage reports). I assume that some bad client has used the system call `shutdown` to close only part of the socket connection and that `waitress`does not anticipate something like that.

djay linked a pull request Aug 22, 2023 that will close this issue

fix exceptions causing end requests without start requests (and then causing 100% cpu on lock.acquire... maybe) #16

Open

djay mentioned this issue Sep 11, 2023

100% cpu in mainthread due to not closing properly? (channel.connected == False) Pylons/waitress#418

Closed

using 100% trying to acquire lock #15

using 100% trying to acquire lock #15

Comments

djay commented Aug 22, 2023

djay commented Aug 22, 2023

djay commented Aug 22, 2023

djay commented Aug 23, 2023 • edited Loading

zopyx commented Aug 23, 2023

d-maurer commented Aug 23, 2023

d-maurer commented Aug 23, 2023

djay commented Aug 23, 2023

jsbueno commented Aug 23, 2023

d-maurer commented Aug 23, 2023 via email

djay commented Aug 24, 2023

djay commented Aug 24, 2023 • edited Loading

d-maurer commented Aug 24, 2023 via email

djay commented Aug 31, 2023

d-maurer commented Aug 31, 2023 via email

djay commented Sep 2, 2023

djay commented Sep 4, 2023 • edited Loading

d-maurer commented Sep 4, 2023 via email

djay commented Sep 4, 2023

d-maurer commented Sep 4, 2023 via email

djay commented Sep 5, 2023

djay commented Sep 5, 2023

d-maurer commented Sep 5, 2023 via email

d-maurer commented Sep 5, 2023 via email

djay commented Sep 5, 2023

d-maurer commented Sep 5, 2023 via email

d-maurer commented Sep 6, 2023 via email

djay commented Sep 11, 2023

davisagli commented Sep 11, 2023

d-maurer commented Sep 11, 2023 via email

djay commented Sep 11, 2023 • edited Loading

djay commented Sep 11, 2023

d-maurer commented Sep 11, 2023 via email

djay commented Sep 11, 2023

d-maurer commented Sep 11, 2023 via email

djay commented Sep 11, 2023 • edited Loading

djay commented Sep 11, 2023

d-maurer commented Sep 11, 2023 via email

djay commented Aug 23, 2023 •

edited

Loading

djay commented Aug 24, 2023 •

edited

Loading

djay commented Sep 4, 2023 •

edited

Loading

djay commented Sep 11, 2023 •

edited

Loading

djay commented Sep 11, 2023 •

edited

Loading