Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage #871

timothymillar · 2022-07-07T23:38:03Z

I've noticed that count_call_alleles and some methods using cohorts create unnecessary dependencies between chunks in the variants dimension. For example the current task graph for observed_heterozygosity on a dataset with 10 chunks in the variants dimentions looks like this:

Task graph

In count_call_alleles this is a result of using da.empty to indicate the number of alleles for a gufunc. In observed_heterozygosity (and also diversity) this is caused by forcing the sample_cohort array to be a dask array which doesn't achieve much because we immediately call compute on that array to get the number of cohorts. Replacing both of these cases with numpy arrays results in the following equivalent task graph:

Task graph

My understanding is that the second task graph should be more efficient to schedule at larger scales (can any dask experts confirm?). Is there any reason not to make such a change? I guess it makes the use of the sample_cohort array a little bit opaque.

The text was updated successfully, but these errors were encountered:

tomwhite · 2022-07-19T10:06:09Z

Interesting - thanks for digging into this. It looks like a good change to me.

timothymillar added the question Further information is requested label Jul 7, 2022

timothymillar mentioned this issue Jul 7, 2022

Avoid unnecessary task dependencies by using numpy arrays #872

Merged

mergify bot closed this as completed in #872 Jul 19, 2022

This was referenced Sep 2, 2024

Run test_count_call_alleles on Cubed #1249

Closed

Run test_count_call_alleles on Cubed (alternative approach) #1254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage #871

Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage #871

timothymillar commented Jul 7, 2022

tomwhite commented Jul 19, 2022

Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage #871

Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage #871

Comments

timothymillar commented Jul 7, 2022

tomwhite commented Jul 19, 2022