You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had a computation running smoothly for many minutes on a Coiled cluster. I had ~10 JupyterLab panes open showing various parts of the dashboard, including the cluster map (pew-pew).
The computation got to a point where task throughput suddenly increased significantly—many tasks completed per second. I noticed that all the dashboard plots became very laggy, started going blank for many seconds, etc. Since plots such as Worker CPU timeseries stopped advancing in time, it seems like the scheduler was very overwhelmed by something.
I looked at scheduler logs and happened to notice many errors thrown from
Unfortunately I've now lost the logs, but I remember the error was something about the WebSocket being closed.
During brief flashes that the worker memory plot reappeared, I saw that many of the workers were under significant memory pressure, largely due to managed memory. I'm guessing this is related to #5114, where when the scheduler is under load, it can't tell workers to release keys, so they keep piling up memory.
So the theory is:
an exception from the WebSocket happened at every transition (of which there were many)
handling this slowed down the scheduler a lot
workers piled up extra keys because of the hobbled scheduler
But this is just a theory. Does it seem at all plausible?
Sounds plausible. The pew-pew plot consumes from the diagnostics websocket. If you don't open the pew-pew plot the websocket plugin never gets registered.
A few thoughts here:
A reproducer would be really helpful in debugging this.
Checking if the socket is open before trying to send may avoid this.
I've never seen the scheduler close out a websocket connection like this, but if it has happened we should improve the reconnect logic.
I haven't tried to reproduce. I'm also not sure how the websocket connection would get closed—would just closing the pew-pew plot / the whole JupyterLab page / quitting the browser do it?
To confirm the theory, I'd just add some scheduler code that closes the websocket after 10s or something. Then open the plot, run a computation, and see what happens.
I'd also say wrapping that socket.send in a try/except would just be reasonable to do in general. We should also check if the socket is closed, and if so, maybe un-register the plugin entirely.
I had a computation running smoothly for many minutes on a Coiled cluster. I had ~10 JupyterLab panes open showing various parts of the dashboard, including the cluster map (pew-pew).
The computation got to a point where task throughput suddenly increased significantly—many tasks completed per second. I noticed that all the dashboard plots became very laggy, started going blank for many seconds, etc. Since plots such as Worker CPU timeseries stopped advancing in time, it seems like the scheduler was very overwhelmed by something.
I looked at scheduler logs and happened to notice many errors thrown from
distributed/distributed/diagnostics/websocket.py
Line 64 in 8cff8b7
Unfortunately I've now lost the logs, but I remember the error was something about the WebSocket being closed.
During brief flashes that the worker memory plot reappeared, I saw that many of the workers were under significant memory pressure, largely due to managed memory. I'm guessing this is related to #5114, where when the scheduler is under load, it can't tell workers to release keys, so they keep piling up memory.
So the theory is:
But this is just a theory. Does it seem at all plausible?
cc @jacobtomlinson
Environment:
main
-ishThe text was updated successfully, but these errors were encountered: