Differential expression performance improvements #234

blrnw3 · 2022-03-11T22:05:05Z

Reviewers

Functional: @atolopko-czi
Readability: @ebezzi

Changes

Remove legacy h5ad diffex algo
Update to latest tiledb
Optimize cxg diffex algo
- remove parallelism logic (tdb handles this better) and config params for this code
- tweak tdb queries based on Bruce's research
Optimize other tdb queries (the get_X_array calls)
Remove col shift logic - legacy
Update tdb context config and related tests
Update run_diffex script

blrnw3 · 2022-03-11T22:05:50Z

hosted/create_rdev_config.py

@@ -78,7 +78,7 @@

  limits:
    column_request_max: 32
-    diffexp_cellcount_max: 50000


the setting for prod is in single-cell-infra, PR incoming

blrnw3 · 2022-03-11T22:06:21Z

server/compute/diffexp_cxg.py

+
+
+def diffexp_ttest(adaptor, maskA, maskB, top_n=8, diffexp_lfc_cutoff=0.01, arr="X"):
+    """


comment moved from diffex_generic

blrnw3 · 2022-03-11T22:06:44Z

server/compute/diffexp_cxg.py

    is_sparse = matrix.schema.sparse

-    if is_sparse:


custom parallelism logic, unneeded

blrnw3 · 2022-03-11T22:09:07Z

server/compute/diffexp_cxg.py


+def diffexp_ttest_from_mean_var(meanA, varA, nA, meanB, varB, nB, top_n, diffexp_lfc_cutoff):


copied from diffex_generic. No changes

blrnw3 · 2022-03-11T22:09:16Z

server/compute/diffexp_cxg.py

+    return result
+
+
+def mean_var_n(X, X_approximate_distribution=XApproximateDistribution.NORMAL):


copied from diffex_generic. No changes

server/dataset/cxg_dataset.py

server/tests/unit/dataset/test_cxg_dataset.py

server/compute/diffexp_cxg.py

bkmartinjr · 2022-03-14T17:50:18Z

server/compute/diffexp_cxg.py

-
-    return mean, v
+def mean_var_cnt_dense(matrix, _, rows):
+    # TODO: why is the list cast still necessary? Tdb is throwing an error


AFAIK, the cast to Python list should not be necessary (and it is moderately expensive). If TDB is really throwing an error, I recommend that we chase that down before merging, as it is potentially a bug of some sort.

Agreed, will add to the task list

Update: this is a bug in recent TileDB-Py versions. In the queue for a fix in TileDB 0.13.2, at which point multi_index should again support either list or ndarray.

server/compute/diffexp_cxg.py

bkmartinjr · 2022-03-14T19:18:34Z

server/dataset/cxg_dataset.py

@@ -23,7 +23,12 @@
 class CxgDataset(Dataset):
    # TODO:  The tiledb context parameters should be a configuration option
    tiledb_ctx = tiledb.Ctx(
-        {"sm.tile_cache_size": 8 * 1024 * 1024 * 1024, "sm.num_reader_threads": 32, "vfs.s3.region": "us-east-1"}
+        {
+            "sm.tile_cache_size": 16 * 1024 ** 3,


will these changes be accompanied with a change of instance type? The large tile_cache_size and py.init_buffer_bytes are suitable for a high-mem instance, but not for the current 64GB instances. Thought: should these be pulled out into a deployment configuration (eg, set via env vars), or made adaptive (eg, tile cache is X% of RAM on the instance)?

Yes there's an infra PR here: https://github.com/chanzuckerberg/single-cell-infra/pull/526
Re: making them dynamic based on instance - agreed!

bkmartinjr · 2022-03-14T19:20:07Z

server/dataset/cxg_dataset.py

@@ -265,18 +270,12 @@ def get_X_array(self, obs_mask=None, var_mask=None):
            if obs_items == slice(None) and var_items == slice(None):
                data = X[:, :]


shouldn't this be an unordered query as well?

Good spot. Though is this even used. TODO: check if it is

It is used. Added a commit to simplify the logic, as multi_index[] can provide both query types -- we don't need the branch.

bkmartinjr

Most of my comments are small nits or could be done in future PR (eg, better configuration).

Items that should be resolved pre-merge:

cxg_dataset.py - looks like the X query is missing the unordered param
diffexp_cxg.py - the addition of the list() cast in mean_var_cnt_dense is either a bug or unnecessary (comment implies the former, which should be run to ground).
And as Ben noted, the TDB bug needs to be fixed

codecov · 2022-03-28T21:45:37Z

Codecov Report

Merging #234 (2c8db9f) into main (f5993b9) will decrease coverage by 0.34%.
The diff coverage is 82.01%.

❗ Current head 2c8db9f differs from pull request most recent head 00e9930. Consider uploading reports for the commit 00e9930 to get more accurate results

@@            Coverage Diff             @@
##             main     #234      +/-   ##
==========================================
- Coverage   76.47%   76.12%   -0.35%     
==========================================
  Files          92       90       -2     
  Lines        6758     6660      -98     
==========================================
- Hits         5168     5070      -98     
  Misses       1590     1590

Flag	Coverage Δ
frontend	`76.12% <82.01%> (-0.35%)`	⬇️
javascript	`76.12% <82.01%> (-0.35%)`	⬇️
smokeTest	`?`
unitTest	`76.12% <82.01%> (-0.35%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
server/common/config/server_config.py	`86.63% <ø> (-0.81%)`	⬇️
server/dataset/dataset.py	`73.30% <66.66%> (-0.29%)`	⬇️
server/common/rest.py	`80.00% <80.00%> (-0.10%)`	⬇️
server/compute/diffexp_cxg.py	`80.70% <82.29%> (-3.62%)`	⬇️
server/app/api/v3.py	`95.86% <100.00%> (+0.17%)`	⬆️
server/dataset/cxg_dataset.py	`85.71% <100.00%> (-1.66%)`	⬇️
compute/diffexp_cxg.py	`80.70% <0.00%> (-3.62%)`	⬇️
dataset/cxg_dataset.py	`85.71% <0.00%> (-1.66%)`	⬇️
common/config/server_config.py	`86.63% <0.00%> (-0.81%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5993b9...00e9930. Read the comment docs.

bkmartinjr

I have resolved my own prior comments, but would like additional reviewers for the new modifications.

ebezzi · 2022-04-21T19:44:26Z

server/dataset/cxg_dataset.py

+        {
+            "sm.tile_cache_size": 8 * 1024 ** 3,
+            "py.init_buffer_bytes": 16 * 1024 ** 3,
+            "vfs.s3.region": "us-east-1",


I was wondering why the region is even required. Our buckets live in us-west-2 so it looks like this is ignored by TileDB.

The idea was just to have an explicit default in the code so we didn't have to rely on a TileDB default. You are right that the current configurations override it to match our current deployment. As or deployment evolves (eg, multi-region?), the config could change.

ebezzi · 2022-04-21T19:44:59Z

server/dataset/cxg_dataset.py

@@ -244,7 +240,7 @@ def __remap_indices(self, coord_range, coord_mask, coord_data):
        if coord_mask is None:
            return coord_range, coord_data

-        indices = np.where(coord_mask)[0]
+        indices = coord_mask.nonzero()[0]


Out of curiosity, is this more efficient than using np.where?

no. They are literally identical, but np.where is deprecated in this case. See the numpy documentation here: https://numpy.org/doc/stable/reference/generated/numpy.where.html

It says:

Note

When only condition is provided, this function is a shorthand for np.asarray(condition).nonzero(). Using nonzero directly should be preferred, as it behaves correctly for subclasses. The rest of this documentation covers only the case where all three arguments are provided.

atolopko-czi

Approving, with a few minor comments!

atolopko-czi · 2022-04-29T14:16:25Z

hosted/requirements.txt

@@ -46,7 +46,7 @@ umap-learn==0.4.6
 sentry-sdk[flask]==0.14.3
 six==1.14.0
 sqlalchemy==1.3.18
-tiledb==0.13.1
+tiledb==0.13.2


Stick with 0.13.2 or should we try 0.14.x now? @MDunitz has already found issues with 0.14.x, so maybe hold off if 0.13.2 is well-tested?

Either way, we need to keep Explorer at-or-ahead of Portal's TileDB version. And we should keep this version in sync with server/requirements.txt (in this repo).

atolopko-czi · 2022-04-29T14:25:10Z

server/dataset/cxg_dataset.py

-                if var_items == slice(None):
-                    densedata += X_col_shift[:]
-                else:
-                    densedata += X_col_shift.multi_index[var_items][""]


iirc, there was but 1 dataset that used X_col_shift and we're regenerating CXGs, so presumably not having to do deal with this moving forward. Even so, are we're validating somewhere that the CXG versions we're admitting no longer support X_col_shift? If not, do we want to remain defensive here and fail loudly if X_col_shift is used?

atolopko-czi · 2022-04-29T14:28:06Z

server/tests/performance/run_diffexp.py

@@ -20,31 +18,35 @@ def main():
    parser.add_argument("-vb", "--varB", help="obs variable:value to use for group B")
    parser.add_argument("-t", "--trials", default=1, type=int, help="number of trials")
    parser.add_argument(
-        "-a", "--alg", choices=("default", "generic", "cxg"), default="default", help="algorithm to use"
+        "-a", "--alg", choices=("cxg",), default="cxg", help="algorithm to use"


with but one choice, can we remove the option altogether?

* first cut at new binary diffex routes * lint * clean up benchmark * update diffex benchmark * lower memory footprint by using incremental query for diffex * update to tiledb 0.13.3 * fix misleading comment * add additional unit tests for new route * improve stats * incremental mean and variance calculation to reduce memory footprint * LINT!

blrnw3 added 4 commits March 10, 2022 12:09

Diffex perf

30e6250

Diffex algo tweaks

f169ab2

algo work and shift be gone

6ef1a82

revert xarray change

06a92b6

blrnw3 commented Mar 11, 2022

View reviewed changes

blrnw3 added 2 commits March 11, 2022 16:10

unrevert fine change

0189df6

tdb bug

ae583f1

blrnw3 commented Mar 12, 2022

View reviewed changes

server/tests/unit/dataset/test_cxg_dataset.py Outdated Show resolved Hide resolved

blrnw3 marked this pull request as ready for review March 14, 2022 17:09

bkmartinjr reviewed Mar 14, 2022

View reviewed changes

server/compute/diffexp_cxg.py Outdated Show resolved Hide resolved

bkmartinjr reviewed Mar 14, 2022

View reviewed changes

server/compute/diffexp_cxg.py Outdated Show resolved Hide resolved

bkmartinjr reviewed Mar 14, 2022

View reviewed changes

bkmartinjr suggested changes Mar 14, 2022

View reviewed changes

blrnw3 mentioned this pull request Mar 14, 2022

Diffex improvements: final steps to be done #235

Closed

40 tasks

bkmartinjr mentioned this pull request Mar 25, 2022

bump tiledb to 0.13.1, add work-around for dense read bug #237

Merged

bkmartinjr added 5 commits March 28, 2022 12:49

merge with main

cf01a6e

remove debugging print

104e513

add note about tiledb workaround

ea76455

remove obsolete configuration hooks for diffexp parallelism

462b01f

lint

a414895

bkmartinjr marked this pull request as draft March 28, 2022 21:46

bkmartinjr added 2 commits March 28, 2022 14:47

replace deprecated np.where use with recommended nonzero()

05b9e1e

simplify X slicing

750eee4

bkmartinjr marked this pull request as ready for review March 28, 2022 22:33

bkmartinjr requested review from metakuni and atolopko-czi March 28, 2022 22:34

bkmartinjr approved these changes Mar 28, 2022

View reviewed changes

bkmartinjr added 8 commits March 28, 2022 16:14

Merge branch 'main' into brodgers/diffex

c0f4f46

Merge branch 'main' into brodgers/diffex

2074d08

remove tiledb 0.13.1 work-arounds

191edfd

revert tiledb.Ctx handling to correctly use configuration subsystem

e1e47ba

clean up tiledb config params

075c00c

smaller size for default python tiledb query buffer

63f5db5

try even smaller default python buffer

f294595

1GB default tiledb python buffer

002120b

bkmartinjr mentioned this pull request Apr 4, 2022

Add high performance diffex routes #244

Merged

bkmartinjr requested review from ebezzi and removed request for metakuni April 11, 2022 16:01

ebezzi approved these changes Apr 21, 2022

View reviewed changes

atolopko-czi approved these changes Apr 29, 2022

View reviewed changes

Bruce Martin and others added 2 commits May 2, 2022 09:10

Merge branch 'main' into brodgers/diffex

ab9de54

ebezzi changed the title ~~Brodgers/diffex~~ Differential expression performance improvements May 2, 2022

ebezzi added 3 commits May 12, 2022 10:12

Merge branch 'main' into brodgers/diffex

46b37ce

Infra changes

3b86eae

Replace modules with symlinks

00e9930

ebezzi merged commit 1e9965a into main May 12, 2022

ebezzi deleted the brodgers/diffex branch May 12, 2022 17:18

MDunitz mentioned this pull request May 19, 2022

chore: promote staging to prod #273

Merged

czi-github-helper bot mentioned this pull request Jun 20, 2024

chore(main): release 1.0.0 #995

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differential expression performance improvements #234

Differential expression performance improvements #234

blrnw3 commented Mar 11, 2022 •

edited by bkmartinjr

Loading

blrnw3 Mar 11, 2022

blrnw3 Mar 11, 2022

blrnw3 Mar 11, 2022

blrnw3 Mar 11, 2022

blrnw3 Mar 11, 2022

bkmartinjr Mar 14, 2022 •

edited

Loading

blrnw3 Mar 14, 2022

bkmartinjr Mar 28, 2022

bkmartinjr Mar 14, 2022

blrnw3 Mar 14, 2022

bkmartinjr Mar 14, 2022

blrnw3 Mar 14, 2022

bkmartinjr Mar 28, 2022

bkmartinjr left a comment

codecov bot commented Mar 28, 2022 •

edited

Loading

bkmartinjr left a comment •

edited

Loading

ebezzi Apr 21, 2022

bkmartinjr Apr 21, 2022

ebezzi Apr 21, 2022

bkmartinjr Apr 21, 2022 •

edited

Loading

atolopko-czi left a comment

atolopko-czi Apr 29, 2022

atolopko-czi Apr 29, 2022

atolopko-czi Apr 29, 2022



		def diffexp_ttest(adaptor, maskA, maskB, top_n=8, diffexp_lfc_cutoff=0.01, arr="X"):
		"""


		def diffexp_ttest_from_mean_var(meanA, varA, nA, meanB, varB, nB, top_n, diffexp_lfc_cutoff):

		return result


		def mean_var_n(X, X_approximate_distribution=XApproximateDistribution.NORMAL):

		@@ -265,18 +270,12 @@ def get_X_array(self, obs_mask=None, var_mask=None):
		if obs_items == slice(None) and var_items == slice(None):
		data = X[:, :]

Differential expression performance improvements #234

Differential expression performance improvements #234

Conversation

blrnw3 commented Mar 11, 2022 • edited by bkmartinjr Loading

Reviewers

Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmartinjr Mar 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmartinjr left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 28, 2022 • edited Loading

Codecov Report

bkmartinjr left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkmartinjr Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

atolopko-czi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blrnw3 commented Mar 11, 2022 •

edited by bkmartinjr

Loading

bkmartinjr Mar 14, 2022 •

edited

Loading

codecov bot commented Mar 28, 2022 •

edited

Loading

bkmartinjr left a comment •

edited

Loading

bkmartinjr Apr 21, 2022 •

edited

Loading