[python] Shuffle multiple SOMA chunks #1103

atolopko-czi · 2024-04-16T15:10:31Z

Adds a shuffle_chunk_count parameter.

Improves randomness of shuffling, while allowing for explicit tuning of memory usage vs I/O performance.

Adds a `shuffle_chunk_count` parameter. Improves randomness of shuffling, while allowing for explicit tuning of memory usage vs I/O performance.

atolopko-czi · 2024-04-16T15:12:09Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+                for shuffle_chunks in np.array_split(obs_joinids_chunked, splits)
+            )
+        else:
+            self.obs_joinids_chunks_iter = iter(obs_joinids_chunked)


should factor this out into a method

If I am reading this correctly, if shuffle_chunk_count = 2, then this block of code would split the list globally shuffled chunks into 2 lists of chunks where each such list contains say N/2 chunks. It would then concatenate these N/2 chunks in memory and then shuffle this concatenated ndarray in memory.

This might exceed the memory budget and cause an OOM crash.

I think the splits calculation in line 132 should be removed, and shuffle_chunk_count should be used in place of splits in line 135.

So if shuffle_chunk_count = 2, it would take 2 chunks at a time and concatenate them and then shuffled the concatenated ndarray which would fit into the memory budget.

@atolopko-czi - Please disgard the above comment. I confused the math. Assuming There are 16 chunks and shuffle_chunk_count = 2, you would need splits = 16/2 = 8 to split the 16 chunks into 8 partitions of 2 chunks each.

This makes sense now. I misunderstood numpy.array_split interface

yeah, probably worth refactoring this to improve clarity; i'm pushing some variable renames now fwiw

atolopko-czi · 2024-04-16T15:15:03Z

api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py

+    # same elements
+    assert set(soma_joinids) == set(range(16))
+    # not ordered! (...with a `1/16!` probability of being ordered)
+    assert soma_joinids != list(range(16))


TODO: Could assert that the first and second half of the soma_joinids are each formed from exactly 2 quarters of the full data set (without knowing which ones specifically, since it's random).

codecov · 2024-04-16T16:22:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.15%. Comparing base (c18c1a9) to head (e3dd3b1).
Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1103      +/-   ##
==========================================
+ Coverage   91.12%   91.15%   +0.02%     
==========================================
  Files          77       77              
  Lines        5902     5922      +20     
==========================================
+ Hits         5378     5398      +20     
  Misses        524      524

Flag	Coverage Δ
unittests	`91.15% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

prathapsridharan · 2024-04-17T04:16:28Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+                for shuffle_chunks in np.array_split(obs_joinids_chunked, splits)
+            )
+        else:
+            self.obs_joinids_chunks_iter = iter(obs_joinids_chunked)


If I am reading this correctly, if shuffle_chunk_count = 2, then this block of code would split the list globally shuffled chunks into 2 lists of chunks where each such list contains say N/2 chunks. It would then concatenate these N/2 chunks in memory and then shuffle this concatenated ndarray in memory.

This might exceed the memory budget and cause an OOM crash.

I think the splits calculation in line 132 should be removed, and shuffle_chunk_count should be used in place of splits in line 135.

So if shuffle_chunk_count = 2, it would take 2 chunks at a time and concatenate them and then shuffled the concatenated ndarray which would fit into the memory budget.

prathapsridharan · 2024-04-17T06:01:35Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+                for shuffle_chunks in np.array_split(obs_joinids_chunked, splits)
+            )
+        else:
+            self.obs_joinids_chunks_iter = iter(obs_joinids_chunked)


@atolopko-czi - Please disgard the above comment. I confused the math. Assuming There are 16 chunks and shuffle_chunk_count = 2, you would need splits = 16/2 = 8 to split the 16 chunks into 8 partitions of 2 chunks each.

This makes sense now. I misunderstood numpy.array_split interface

prathapsridharan · 2024-05-14T16:29:38Z

For informational purposes, a description of a general algorithm is captured in this github issue:

#1146

…erberg/cell-census into pytorch-shuffle-multiple-chunks

prathapsridharan · 2024-06-04T00:38:53Z

api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py

@@ -570,6 +570,37 @@ def test__shuffle(soma_experiment: Experiment) -> None:
    assert X_values == soma_joinids


+# noinspection PyTestParametrized,DuplicatedCode
+@pytest.mark.parametrize("obs_range,var_range,X_value_gen", [(16, 1, pytorch_seq_x_value_gen)])


Recommend making soma_chunk_size, shuffle_chunk_count and batch_size parameters in the test. This test will likely be extended in a general way and the parameterizing it makes it clear to the reader.

Currently, batch_size argument is taking the default value, which is 1 but there is some tricky logic in the code to deal with batch_size > 1.

Simply parameterizing the test with:

"batch_size, soma_chunk_size, shuffle_chunk_count", [(1, 2, 4)] would make thing clearer

prathapsridharan · 2024-06-04T01:05:36Z

api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py

+    # If shuffle_chunk_count is defined, each batch should contain elements from different chunks
+    batches = [soma_joinids[i : i + 4] for i in range(0, len(all_rows), 4)]
+    assert any(max(batch) - min(batch) > 3 for batch in batches)
+


I would recommend against assertions like these when randomness is part of the function. Testing randomness is pretty complex topic and I think these types of assertions make the test flakey (because of randomness) and doesn't give the reader of the test much information about what is going on.

Instead of lines 599-601, I might recommend accessing self.soma_chunk_iter of the _ObsAndXIterator object and checking:

assert len(soma_chunk_iter) == obs_range // (soma_chunk_size * shuffle_chunk_count)

Each element of soma_chunk_iter is of size (soma_chunk_size * shuffle_chunk_count) except possibly the last element

Nice to have: If batch_size > 1 and batch_size does not evenly divide soma_chunk_size , then check that a batch_size length of soma_joinids or X_values in lines 588-589 correctly cross the boundary of soma_chunk_iter

(3) is probably not necessary right now but I think (1) and (2) tests the crux of the algorithm and assume randomizing the concatenated chunks just work.

I agree that doesn't test how random the scatter gather algorithm is and that is definitely worth testing but I think it involves more thought and should be a separate test altogether.

IMHO, tests of randomness are more like stress tests than unit tests and they require quite a large amount of data and will probably be slower than what is acceptable for unit tests

prathapsridharan · 2024-06-04T17:30:20Z

Per a synchronous conversation with @ebezzi we decided to scrap the test as it needs more thought. We will make a ticket to write a better test for it but for the sake of expediency, we want to get this merged and get some of our first users to use it. Anecdotal evidence suggests that the scatter-gather-shuffle algorithm is performant and gives good randomness.

prathapsridharan

LGTM

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

ivirshup · 2024-06-05T20:54:20Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

+def list_split(arr_list: List[Any], sublist_len: int) -> List[List[Any]]:
+    """Splits a python list into a list of sublists where each sublist is of size `sublist_len`."""


I believe this is the same as itertools.batched, but that is only available from python 3.12+.

Do you do comments for "todo once minimum python is 3.12"?

Let's add it.

…ml/pytorch.py Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

Shuffle multiple SOMA chunks

3e2d334

Adds a `shuffle_chunk_count` parameter. Improves randomness of shuffling, while allowing for explicit tuning of memory usage vs I/O performance.

atolopko-czi assigned prathapsridharan and ebezzi Apr 16, 2024

atolopko-czi commented Apr 16, 2024

View reviewed changes

rename variables for clarity

b07de68

prathapsridharan requested changes Apr 17, 2024

View reviewed changes

prathapsridharan approved these changes Apr 17, 2024

View reviewed changes

Replace np.array_split with a list implementation

2e5cdda

ebezzi added 2 commits June 3, 2024 11:58

Merge branch 'main' into pytorch-shuffle-multiple-chunks

5a22e2f

Merge branch 'pytorch-shuffle-multiple-chunks' of github.com:chanzuck…

a97de8a

…erberg/cell-census into pytorch-shuffle-multiple-chunks

ebezzi changed the title ~~Shuffle multiple SOMA chunks~~ [python] Shuffle multiple SOMA chunks Jun 3, 2024

ebezzi marked this pull request as ready for review June 3, 2024 20:53

Add unit test condition

0220e06

prathapsridharan requested changes Jun 4, 2024

View reviewed changes

Replace test with test_list_split

3d9ae2b

ebezzi requested a review from prathapsridharan June 4, 2024 21:38

prathapsridharan approved these changes Jun 4, 2024

View reviewed changes

ivirshup reviewed Jun 5, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py Outdated Show resolved Hide resolved

ivirshup reviewed Jun 5, 2024

View reviewed changes

ebezzi and others added 2 commits June 5, 2024 14:02

Update api/python/cellxgene_census/src/cellxgene_census/experimental/…

c5cc4fe

…ml/pytorch.py Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

Update pytorch.py

e3dd3b1

ebezzi merged commit 5fc72e2 into main Jun 5, 2024
15 checks passed

ebezzi deleted the pytorch-shuffle-multiple-chunks branch June 5, 2024 23:20

ryan-williams mentioned this pull request Jun 6, 2024

[python] ExperimentDataPipe: configurable method (nd.array, scipy.coo, scipy.csr) #1169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Shuffle multiple SOMA chunks #1103

[python] Shuffle multiple SOMA chunks #1103

atolopko-czi commented Apr 16, 2024

atolopko-czi Apr 16, 2024

prathapsridharan Apr 17, 2024

prathapsridharan Apr 17, 2024 •

edited

Loading

atolopko-czi Apr 17, 2024

atolopko-czi Apr 16, 2024

codecov bot commented Apr 16, 2024 •

edited

Loading

prathapsridharan Apr 17, 2024

prathapsridharan Apr 17, 2024 •

edited

Loading

prathapsridharan commented May 14, 2024

prathapsridharan Jun 4, 2024

prathapsridharan Jun 4, 2024

prathapsridharan commented Jun 4, 2024

prathapsridharan left a comment

ivirshup Jun 5, 2024

ebezzi Jun 5, 2024

		def list_split(arr_list: List[Any], sublist_len: int) -> List[List[Any]]:
		"""Splits a python list into a list of sublists where each sublist is of size `sublist_len`."""

[python] Shuffle multiple SOMA chunks #1103

[python] Shuffle multiple SOMA chunks #1103

Conversation

atolopko-czi commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prathapsridharan Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Apr 16, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

prathapsridharan Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

prathapsridharan commented May 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prathapsridharan commented Jun 4, 2024

prathapsridharan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prathapsridharan Apr 17, 2024 •

edited

Loading

codecov bot commented Apr 16, 2024 •

edited

Loading

prathapsridharan Apr 17, 2024 •

edited

Loading