Profile Cythonization work #4430

mrocklin · 2021-01-14T19:18:22Z

We've been optimizing the scheduler for larger workloads recently. We've found that profiling the scheduler is challenging to do well. A lot of the recent profiling has happened on NVIDIA systems, which may or may not be representative of typical hardware.

I recently ran a small experiment on AWS with the following code:

import coiled
coiled.create_software_environment(
    "dev", 
    conda=["python=3.8", "dask", "lz4", "python-blosc"], 
    pip=["git+https://github.com/dask/dask", "git+https://github.com/dask/distributed"]
)

cluster = coiled.Cluster(n_workers=100, software="dev")

from dask.distributed import Client, performance_report
client = Client(cluster)

import dask
import dask.dataframe as dd
dask.config.set({"optimization.fuse.active": False})
df = dask.datasets.timeseries(start="2020-01-01", end="2020-12-31", partition_freq="1h", freq="60s")
df2 = dd.shuffle.shuffle(df, "x")
with performance_report(filename="report.html"):
    df3 = df2.sum().compute()

I found that

The scheduler CPU was pegged at 100%
The majority of time was spent in communication
The computation actually didn't finish, I suspect because of the worker TaskState issue in 2020.12.0, but I can't be sure

It would be good to try a few things here:

Try bringing in @fjetter's recent PR Deadlocks and infinite loops connected to failed dependencies #4360 to see if it resolves the deadlocks
Try compiling with Cython and seeing if that helps at all, this will probably require making a docker image that installs distributed with the appropriate flag (cc @jakirkham do we have instructions for this somewhere?) and using that instead of a Coiled-built image
Actually produce and publish performance reports and publish them on this issue

After that we need to consider what to do about TLS communication performance. Is this just because we're using TLS rather than TCP? If so, is there anything that we can do to accelerate this? Maybe asyncio is faster? Maybe Tornado can be improved?

NVIDIA devs @quasiben and @jakirkham report that UCX doesn't have this problem (although that may be because it's hard to profile).

The text was updated successfully, but these errors were encountered:

jakirkham · 2021-01-14T19:24:04Z

The computation actually didn't finish, I suspect because of the worker TaskState issue in 2020.12.0, but I can't be sure

Do you have a link for this issue?

mrocklin · 2021-01-14T19:24:59Z

#4360

mrocklin · 2021-01-14T20:06:29Z

I should note that here I'm using Coiled (sorry for the indirect advertisement). Doing this at scale requires going above the free tier limits. If anyone wants to help with this (or even play around) let me know and I'll add you to a team with significantly increased privileges.

Of course, this should also work just fine on other systems like custom-maintained Kubernetes clusters.

jrbourbeau mentioned this issue Jan 15, 2021

Deadlocks and infinite loops connected to failed dependencies #4360

Closed

mrocklin mentioned this issue Jan 19, 2021

Set runspec on all new tasks to avoid deadlocks #4432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile Cythonization work #4430

Profile Cythonization work #4430

mrocklin commented Jan 14, 2021 •

edited

Loading

jakirkham commented Jan 14, 2021

mrocklin commented Jan 14, 2021

mrocklin commented Jan 14, 2021

Profile Cythonization work #4430

Profile Cythonization work #4430

Comments

mrocklin commented Jan 14, 2021 • edited Loading

jakirkham commented Jan 14, 2021

mrocklin commented Jan 14, 2021

mrocklin commented Jan 14, 2021

mrocklin commented Jan 14, 2021 •

edited

Loading