Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Shuffle multiple SOMA chunks #1103

Merged
merged 9 commits into from
Jun 5, 2024
Merged

Conversation

atolopko-czi
Copy link
Collaborator

Adds a shuffle_chunk_count parameter.

Improves randomness of shuffling, while allowing for explicit tuning of memory usage vs I/O performance.

Adds a `shuffle_chunk_count` parameter.

Improves randomness of shuffling, while allowing for explicit tuning of memory
usage vs I/O performance.
for shuffle_chunks in np.array_split(obs_joinids_chunked, splits)
)
else:
self.obs_joinids_chunks_iter = iter(obs_joinids_chunked)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should factor this out into a method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading this correctly, if shuffle_chunk_count = 2, then this block of code would split the list globally shuffled chunks into 2 lists of chunks where each such list contains say N/2 chunks. It would then concatenate these N/2 chunks in memory and then shuffle this concatenated ndarray in memory.

This might exceed the memory budget and cause an OOM crash.

I think the splits calculation in line 132 should be removed, and shuffle_chunk_count should be used in place of splits in line 135.

So if shuffle_chunk_count = 2, it would take 2 chunks at a time and concatenate them and then shuffled the concatenated ndarray which would fit into the memory budget.

Copy link
Contributor

@prathapsridharan prathapsridharan Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atolopko-czi - Please disgard the above comment. I confused the math. Assuming There are 16 chunks and shuffle_chunk_count = 2, you would need splits = 16/2 = 8 to split the 16 chunks into 8 partitions of 2 chunks each.

This makes sense now. I misunderstood numpy.array_split interface

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, probably worth refactoring this to improve clarity; i'm pushing some variable renames now fwiw

# same elements
assert set(soma_joinids) == set(range(16))
# not ordered! (...with a `1/16!` probability of being ordered)
assert soma_joinids != list(range(16))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Could assert that the first and second half of the soma_joinids are each formed from exactly 2 quarters of the full data set (without knowing which ones specifically, since it's random).

Copy link

codecov bot commented Apr 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.15%. Comparing base (c18c1a9) to head (e3dd3b1).
Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1103      +/-   ##
==========================================
+ Coverage   91.12%   91.15%   +0.02%     
==========================================
  Files          77       77              
  Lines        5902     5922      +20     
==========================================
+ Hits         5378     5398      +20     
  Misses        524      524              
Flag Coverage Δ
unittests 91.15% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

for shuffle_chunks in np.array_split(obs_joinids_chunked, splits)
)
else:
self.obs_joinids_chunks_iter = iter(obs_joinids_chunked)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading this correctly, if shuffle_chunk_count = 2, then this block of code would split the list globally shuffled chunks into 2 lists of chunks where each such list contains say N/2 chunks. It would then concatenate these N/2 chunks in memory and then shuffle this concatenated ndarray in memory.

This might exceed the memory budget and cause an OOM crash.

I think the splits calculation in line 132 should be removed, and shuffle_chunk_count should be used in place of splits in line 135.

So if shuffle_chunk_count = 2, it would take 2 chunks at a time and concatenate them and then shuffled the concatenated ndarray which would fit into the memory budget.

for shuffle_chunks in np.array_split(obs_joinids_chunked, splits)
)
else:
self.obs_joinids_chunks_iter = iter(obs_joinids_chunked)
Copy link
Contributor

@prathapsridharan prathapsridharan Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atolopko-czi - Please disgard the above comment. I confused the math. Assuming There are 16 chunks and shuffle_chunk_count = 2, you would need splits = 16/2 = 8 to split the 16 chunks into 8 partitions of 2 chunks each.

This makes sense now. I misunderstood numpy.array_split interface

@prathapsridharan
Copy link
Contributor

For informational purposes, a description of a general algorithm is captured in this github issue:

#1146

@ebezzi ebezzi changed the title Shuffle multiple SOMA chunks [python] Shuffle multiple SOMA chunks Jun 3, 2024
@ebezzi ebezzi marked this pull request as ready for review June 3, 2024 20:53
@@ -570,6 +570,37 @@ def test__shuffle(soma_experiment: Experiment) -> None:
assert X_values == soma_joinids


# noinspection PyTestParametrized,DuplicatedCode
@pytest.mark.parametrize("obs_range,var_range,X_value_gen", [(16, 1, pytorch_seq_x_value_gen)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend making soma_chunk_size, shuffle_chunk_count and batch_size parameters in the test. This test will likely be extended in a general way and the parameterizing it makes it clear to the reader.

Currently, batch_size argument is taking the default value, which is 1 but there is some tricky logic in the code to deal with batch_size > 1.

Simply parameterizing the test with:

"batch_size, soma_chunk_size, shuffle_chunk_count", [(1, 2, 4)] would make thing clearer

# If shuffle_chunk_count is defined, each batch should contain elements from different chunks
batches = [soma_joinids[i : i + 4] for i in range(0, len(all_rows), 4)]
assert any(max(batch) - min(batch) > 3 for batch in batches)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend against assertions like these when randomness is part of the function. Testing randomness is pretty complex topic and I think these types of assertions make the test flakey (because of randomness) and doesn't give the reader of the test much information about what is going on.

Instead of lines 599-601, I might recommend accessing self.soma_chunk_iter of the _ObsAndXIterator object and checking:

  1. assert len(soma_chunk_iter) == obs_range // (soma_chunk_size * shuffle_chunk_count)
  2. Each element of soma_chunk_iter is of size (soma_chunk_size * shuffle_chunk_count) except possibly the last element
  3. Nice to have: If batch_size > 1 and batch_size does not evenly divide soma_chunk_size , then check that a batch_size length of soma_joinids or X_values in lines 588-589 correctly cross the boundary of soma_chunk_iter

(3) is probably not necessary right now but I think (1) and (2) tests the crux of the algorithm and assume randomizing the concatenated chunks just work.

I agree that doesn't test how random the scatter gather algorithm is and that is definitely worth testing but I think it involves more thought and should be a separate test altogether.

IMHO, tests of randomness are more like stress tests than unit tests and they require quite a large amount of data and will probably be slower than what is acceptable for unit tests

@prathapsridharan
Copy link
Contributor

Per a synchronous conversation with @ebezzi we decided to scrap the test as it needs more thought. We will make a ticket to write a better test for it but for the sake of expediency, we want to get this merged and get some of our first users to use it. Anecdotal evidence suggests that the scatter-gather-shuffle algorithm is performant and gives good randomness.

Copy link
Contributor

@prathapsridharan prathapsridharan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 194 to 195
def list_split(arr_list: List[Any], sublist_len: int) -> List[List[Any]]:
"""Splits a python list into a list of sublists where each sublist is of size `sublist_len`."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the same as itertools.batched, but that is only available from python 3.12+.

Do you do comments for "todo once minimum python is 3.12"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add it.

ebezzi and others added 2 commits June 5, 2024 14:02
…ml/pytorch.py

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
@ebezzi ebezzi merged commit 5fc72e2 into main Jun 5, 2024
15 checks passed
@ebezzi ebezzi deleted the pytorch-shuffle-multiple-chunks branch June 5, 2024 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants