-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Shuffle multiple SOMA chunks #1103
Conversation
Adds a `shuffle_chunk_count` parameter. Improves randomness of shuffling, while allowing for explicit tuning of memory usage vs I/O performance.
for shuffle_chunks in np.array_split(obs_joinids_chunked, splits) | ||
) | ||
else: | ||
self.obs_joinids_chunks_iter = iter(obs_joinids_chunked) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should factor this out into a method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am reading this correctly, if shuffle_chunk_count = 2
, then this block of code would split the list globally shuffled chunks into 2 lists of chunks where each such list contains say N/2 chunks. It would then concatenate these N/2 chunks in memory and then shuffle this concatenated ndarray in memory.
This might exceed the memory budget and cause an OOM crash.
I think the splits
calculation in line 132 should be removed, and shuffle_chunk_count
should be used in place of splits
in line 135.
So if shuffle_chunk_count = 2
, it would take 2 chunks at a time and concatenate them and then shuffled the concatenated ndarray which would fit into the memory budget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atolopko-czi - Please disgard the above comment. I confused the math. Assuming There are 16 chunks
and shuffle_chunk_count = 2
, you would need splits = 16/2 = 8
to split the 16 chunks
into 8 partitions of 2 chunks
each.
This makes sense now. I misunderstood numpy.array_split
interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, probably worth refactoring this to improve clarity; i'm pushing some variable renames now fwiw
# same elements | ||
assert set(soma_joinids) == set(range(16)) | ||
# not ordered! (...with a `1/16!` probability of being ordered) | ||
assert soma_joinids != list(range(16)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Could assert that the first and second half of the soma_joinids
are each formed from exactly 2 quarters of the full data set (without knowing which ones specifically, since it's random).
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1103 +/- ##
==========================================
+ Coverage 91.12% 91.15% +0.02%
==========================================
Files 77 77
Lines 5902 5922 +20
==========================================
+ Hits 5378 5398 +20
Misses 524 524
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
for shuffle_chunks in np.array_split(obs_joinids_chunked, splits) | ||
) | ||
else: | ||
self.obs_joinids_chunks_iter = iter(obs_joinids_chunked) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am reading this correctly, if shuffle_chunk_count = 2
, then this block of code would split the list globally shuffled chunks into 2 lists of chunks where each such list contains say N/2 chunks. It would then concatenate these N/2 chunks in memory and then shuffle this concatenated ndarray in memory.
This might exceed the memory budget and cause an OOM crash.
I think the splits
calculation in line 132 should be removed, and shuffle_chunk_count
should be used in place of splits
in line 135.
So if shuffle_chunk_count = 2
, it would take 2 chunks at a time and concatenate them and then shuffled the concatenated ndarray which would fit into the memory budget.
for shuffle_chunks in np.array_split(obs_joinids_chunked, splits) | ||
) | ||
else: | ||
self.obs_joinids_chunks_iter = iter(obs_joinids_chunked) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atolopko-czi - Please disgard the above comment. I confused the math. Assuming There are 16 chunks
and shuffle_chunk_count = 2
, you would need splits = 16/2 = 8
to split the 16 chunks
into 8 partitions of 2 chunks
each.
This makes sense now. I misunderstood numpy.array_split
interface
For informational purposes, a description of a general algorithm is captured in this github issue: |
…erberg/cell-census into pytorch-shuffle-multiple-chunks
@@ -570,6 +570,37 @@ def test__shuffle(soma_experiment: Experiment) -> None: | |||
assert X_values == soma_joinids | |||
|
|||
|
|||
# noinspection PyTestParametrized,DuplicatedCode | |||
@pytest.mark.parametrize("obs_range,var_range,X_value_gen", [(16, 1, pytorch_seq_x_value_gen)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend making soma_chunk_size
, shuffle_chunk_count
and batch_size
parameters in the test. This test will likely be extended in a general way and the parameterizing it makes it clear to the reader.
Currently, batch_size
argument is taking the default value, which is 1
but there is some tricky logic in the code to deal with batch_size > 1
.
Simply parameterizing the test with:
"batch_size, soma_chunk_size, shuffle_chunk_count", [(1, 2, 4)]
would make thing clearer
# If shuffle_chunk_count is defined, each batch should contain elements from different chunks | ||
batches = [soma_joinids[i : i + 4] for i in range(0, len(all_rows), 4)] | ||
assert any(max(batch) - min(batch) > 3 for batch in batches) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend against assertions like these when randomness is part of the function. Testing randomness is pretty complex topic and I think these types of assertions make the test flakey (because of randomness) and doesn't give the reader of the test much information about what is going on.
Instead of lines 599-601
, I might recommend accessing self.soma_chunk_iter of the _ObsAndXIterator object and checking:
assert len(soma_chunk_iter) == obs_range // (soma_chunk_size * shuffle_chunk_count)
- Each element of
soma_chunk_iter
is of size(soma_chunk_size * shuffle_chunk_count)
except possibly the last element - Nice to have: If
batch_size > 1
andbatch_size
does not evenly dividesoma_chunk_size
, then check that abatch_size
length ofsoma_joinids
orX_values
inlines 588-589
correctly cross the boundary ofsoma_chunk_iter
(3) is probably not necessary right now but I think (1) and (2) tests the crux of the algorithm and assume randomizing the concatenated chunks just work.
I agree that doesn't test how random the scatter gather algorithm is and that is definitely worth testing but I think it involves more thought and should be a separate test altogether.
IMHO, tests of randomness are more like stress tests than unit tests and they require quite a large amount of data and will probably be slower than what is acceptable for unit tests
Per a synchronous conversation with @ebezzi we decided to scrap the test as it needs more thought. We will make a ticket to write a better test for it but for the sake of expediency, we want to get this merged and get some of our first users to use it. Anecdotal evidence suggests that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
def list_split(arr_list: List[Any], sublist_len: int) -> List[List[Any]]: | ||
"""Splits a python list into a list of sublists where each sublist is of size `sublist_len`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the same as itertools.batched, but that is only available from python 3.12+.
Do you do comments for "todo once minimum python is 3.12"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add it.
…ml/pytorch.py Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
Adds a
shuffle_chunk_count
parameter.Improves randomness of shuffling, while allowing for explicit tuning of memory usage vs I/O performance.