You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed that count_call_alleles and some methods using cohorts create unnecessary dependencies between chunks in the variants dimension. For example the current task graph for observed_heterozygosity on a dataset with 10 chunks in the variants dimentions looks like this:
Task graph
In count_call_alleles this is a result of using da.empty to indicate the number of alleles for a gufunc. In observed_heterozygosity (and also diversity) this is caused by forcing the sample_cohort array to be a dask array which doesn't achieve much because we immediately call compute on that array to get the number of cohorts. Replacing both of these cases with numpy arrays results in the following equivalent task graph:
Task graph
My understanding is that the second task graph should be more efficient to schedule at larger scales (can any dask experts confirm?). Is there any reason not to make such a change? I guess it makes the use of the sample_cohort array a little bit opaque.
The text was updated successfully, but these errors were encountered:
I've noticed that
count_call_alleles
and some methods using cohorts create unnecessary dependencies between chunks in the variants dimension. For example the current task graph forobserved_heterozygosity
on a dataset with 10 chunks in the variants dimentions looks like this:Task graph
In
count_call_alleles
this is a result of usingda.empty
to indicate the number of alleles for a gufunc. Inobserved_heterozygosity
(and alsodiversity
) this is caused by forcing thesample_cohort
array to be a dask array which doesn't achieve much because we immediately call compute on that array to get the number of cohorts. Replacing both of these cases with numpy arrays results in the following equivalent task graph:Task graph
My understanding is that the second task graph should be more efficient to schedule at larger scales (can any dask experts confirm?). Is there any reason not to make such a change? I guess it makes the use of the
sample_cohort
array a little bit opaque.The text was updated successfully, but these errors were encountered: