Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pew-pew plot maybe? cripples cluster if WebSocket disconnects #5270

Open
gjoseph92 opened this issue Aug 26, 2021 · 3 comments
Open

Pew-pew plot maybe? cripples cluster if WebSocket disconnects #5270

gjoseph92 opened this issue Aug 26, 2021 · 3 comments
Labels
needs info Needs further information from the user

Comments

@gjoseph92
Copy link
Collaborator

I had a computation running smoothly for many minutes on a Coiled cluster. I had ~10 JupyterLab panes open showing various parts of the dashboard, including the cluster map (pew-pew).

The computation got to a point where task throughput suddenly increased significantly—many tasks completed per second. I noticed that all the dashboard plots became very laggy, started going blank for many seconds, etc. Since plots such as Worker CPU timeseries stopped advancing in time, it seems like the scheduler was very overwhelmed by something.

I looked at scheduler logs and happened to notice many errors thrown from

self.socket.send("transition", data)

Unfortunately I've now lost the logs, but I remember the error was something about the WebSocket being closed.

During brief flashes that the worker memory plot reappeared, I saw that many of the workers were under significant memory pressure, largely due to managed memory. I'm guessing this is related to #5114, where when the scheduler is under load, it can't tell workers to release keys, so they keep piling up memory.

So the theory is:

  • an exception from the WebSocket happened at every transition (of which there were many)
  • handling this slowed down the scheduler a lot
  • workers piled up extra keys because of the hobbled scheduler

But this is just a theory. Does it seem at all plausible?

cc @jacobtomlinson

Environment:

  • Dask version: main-ish
  • Python version: 3.9.6
  • Operating System: linux (cluster)
  • Install method (conda, pip, source): source
@jacobtomlinson
Copy link
Member

Sounds plausible. The pew-pew plot consumes from the diagnostics websocket. If you don't open the pew-pew plot the websocket plugin never gets registered.

A few thoughts here:

  • A reproducer would be really helpful in debugging this.
  • Checking if the socket is open before trying to send may avoid this.
  • I've never seen the scheduler close out a websocket connection like this, but if it has happened we should improve the reconnect logic.

@jrbourbeau
Copy link
Member

Just checking in. @gjoseph92 were you able to reproduce this behavior? Do you have any thoughts on how we should move forward with this issue?

@jrbourbeau jrbourbeau added the needs info Needs further information from the user label Oct 14, 2021
@gjoseph92
Copy link
Collaborator Author

I haven't tried to reproduce. I'm also not sure how the websocket connection would get closed—would just closing the pew-pew plot / the whole JupyterLab page / quitting the browser do it?

To confirm the theory, I'd just add some scheduler code that closes the websocket after 10s or something. Then open the plot, run a computation, and see what happens.

I'd also say wrapping that socket.send in a try/except would just be reasonable to do in general. We should also check if the socket is closed, and if so, maybe un-register the plugin entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info Needs further information from the user
Projects
None yet
Development

No branches or pull requests

3 participants