Pew-pew plot maybe? cripples cluster if WebSocket disconnects #5270

gjoseph92 · 2021-08-26T04:02:01Z

I had a computation running smoothly for many minutes on a Coiled cluster. I had ~10 JupyterLab panes open showing various parts of the dashboard, including the cluster map (pew-pew).

The computation got to a point where task throughput suddenly increased significantly—many tasks completed per second. I noticed that all the dashboard plots became very laggy, started going blank for many seconds, etc. Since plots such as Worker CPU timeseries stopped advancing in time, it seems like the scheduler was very overwhelmed by something.

I looked at scheduler logs and happened to notice many errors thrown from

distributed/distributed/diagnostics/websocket.py

Line 64 in 8cff8b7

self.socket.send("transition", data)

Unfortunately I've now lost the logs, but I remember the error was something about the WebSocket being closed.

During brief flashes that the worker memory plot reappeared, I saw that many of the workers were under significant memory pressure, largely due to managed memory. I'm guessing this is related to #5114, where when the scheduler is under load, it can't tell workers to release keys, so they keep piling up memory.

So the theory is:

an exception from the WebSocket happened at every transition (of which there were many)
handling this slowed down the scheduler a lot
workers piled up extra keys because of the hobbled scheduler

But this is just a theory. Does it seem at all plausible?

cc @jacobtomlinson

Environment:

Dask version: main-ish
Python version: 3.9.6
Operating System: linux (cluster)
Install method (conda, pip, source): source

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2021-08-26T15:41:25Z

Sounds plausible. The pew-pew plot consumes from the diagnostics websocket. If you don't open the pew-pew plot the websocket plugin never gets registered.

A few thoughts here:

A reproducer would be really helpful in debugging this.
Checking if the socket is open before trying to send may avoid this.
I've never seen the scheduler close out a websocket connection like this, but if it has happened we should improve the reconnect logic.

jrbourbeau · 2021-10-14T14:32:16Z

Just checking in. @gjoseph92 were you able to reproduce this behavior? Do you have any thoughts on how we should move forward with this issue?

gjoseph92 · 2021-10-14T18:20:21Z

I haven't tried to reproduce. I'm also not sure how the websocket connection would get closed—would just closing the pew-pew plot / the whole JupyterLab page / quitting the browser do it?

To confirm the theory, I'd just add some scheduler code that closes the websocket after 10s or something. Then open the plot, run a computation, and see what happens.

I'd also say wrapping that socket.send in a try/except would just be reasonable to do in general. We should also check if the socket is closed, and if so, maybe un-register the plugin entirely.

jrbourbeau added the needs info Needs further information from the user label Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pew-pew plot maybe? cripples cluster if WebSocket disconnects #5270

Pew-pew plot maybe? cripples cluster if WebSocket disconnects #5270

gjoseph92 commented Aug 26, 2021

jacobtomlinson commented Aug 26, 2021

jrbourbeau commented Oct 14, 2021

gjoseph92 commented Oct 14, 2021

Pew-pew plot maybe? cripples cluster if WebSocket disconnects #5270

Pew-pew plot maybe? cripples cluster if WebSocket disconnects #5270

Comments

gjoseph92 commented Aug 26, 2021

jacobtomlinson commented Aug 26, 2021

jrbourbeau commented Oct 14, 2021

gjoseph92 commented Oct 14, 2021