-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding implementing a data distribution sensitive shuffling algorithm for dataloader libraries #1146
Comments
A specific comment in the above slack thread recapitulated here because it is useful:
The thinking is picking Even when you don't actively do any chunking, you can view the entire census as a collection of sections/chunks where each chunk is of length 1 row. But that would be inefficient in terms of I/O during reads from the storage medium. So think of each chunk of length > 1 (for I/O efficiency) and determine that chunk size based on the memory budget - number of actual rows that can be loaded in memory. Thus the |
A calculation of expected number of datasets represented when
|
Here is a histogram of the distribution of observations across dataset-ids in the census (computed by @pablo-gar ): |
Implemented in #1188 |
There is potential an opportunity to implement a shuffling algorithm that closely approximates a random sample shuffle. The following is a recapitulation of slack discussion
The high level description of the algorithm is:
chunk_size
Let's call this method
scatter_gather_shuffle
. This method attempts to strike a balance between randomness and good I/O performance (hence reading chunks rather than individual data points which would be less efficient in terms of I/O)The following algorithm is stated in a way such that, if accepted, it could potentially be packaged in a separate tensor library (like pytorch) or a dataloader library and thus be generally useful to many types of ML training workloads.
The following algorithm assumes that the data is uniformly distributed across some buckets. However, it is possible for the algorithm to take in a real distribution (a probability mass function) or an analytic distribution (ex: exponential distribution, poisson distribution, etc) and perform the expectation calculations based on the input data distribution. Also, the description of the algorithm is set in the context of
cellxgene_census
where the data is is bucketed bydataset_id
, however, the algorithm is generally applicable for any dataset that naturally fall into buckets.The central problem is to determine the number of random chunks to gather across the data and the size of each such chunk to then concatenate and shuffle so as to yield a sequence of data points that is satisfactorily random.
"Satisfactorily Random" is something that the user must define here. One definition of "satisfactorily random" that is simple to encode and generally useful is if the user knows how data points are bucketed, then a "satisfactorily random" sequence of K data_points would represent some desired fraction of the buckets.
To put it more concretely, the census has 60 million observations (data_points) distributed across 567 datasets (the buckets). If K observations are drawn at random from the entire corpus what is the expected number of datasets covering these K random points? I think we could work that out analytically:
Thus for K = 500, expected number of datasets represented, E[Y] = 567 * (1 - (566/567)**500) = 332. If K = 2000, E[Y] = 550 (almost all datasets) - this is how I arrived at the 2000 random chunks. Since we want good I/O efficiency, dividing the memory budget (specified in number of rows) by the number of chunks gives us the chunk size: 128_000/2000 = 64
Pseudocode for the algorithm:
The text was updated successfully, but these errors were encountered: