Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Optimise scheduler.get_comm_cost set difference
Computing set(A).difference(B) is O(max(len(A), len(B))). When estimating the communication cost of a task's dependencies it is usual that the number of dependencies (A) will be small but the number of tasks the worker has (B) is large. In this case it is better to manually construct the set difference by iterating over A and checking if each element is in B. Performing a left.merge(right, on="key", how="inner) of a distributed dataframe with eight workers with chunks_per_worker * rows_per_chunk held constant, I observe the following timings using the tcp communication protocol: | chunks_per_worker | rows_per_chunk | before | after | |-------------------|----------------|--------|-------| | 100 | 50000 | 75s | 48s | | 10 | 500000 | ~9s | ~9s | | 1 | 5000000 | ~8s | ~8s |
- Loading branch information