Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differential expression performance improvements #234

Merged
merged 26 commits into from
May 12, 2022
Merged

Conversation

blrnw3
Copy link
Contributor

@blrnw3 blrnw3 commented Mar 11, 2022

Reviewers

Functional: @atolopko-czi
Readability: @ebezzi


Changes

  • Remove legacy h5ad diffex algo
  • Update to latest tiledb
  • Optimize cxg diffex algo
    • remove parallelism logic (tdb handles this better) and config params for this code
    • tweak tdb queries based on Bruce's research
  • Optimize other tdb queries (the get_X_array calls)
  • Remove col shift logic - legacy
  • Update tdb context config and related tests
  • Update run_diffex script

@@ -78,7 +78,7 @@

limits:
column_request_max: 32
diffexp_cellcount_max: 50000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the setting for prod is in single-cell-infra, PR incoming



def diffexp_ttest(adaptor, maskA, maskB, top_n=8, diffexp_lfc_cutoff=0.01, arr="X"):
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment moved from diffex_generic

is_sparse = matrix.schema.sparse

if is_sparse:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom parallelism logic, unneeded


def diffexp_ttest_from_mean_var(meanA, varA, nA, meanB, varB, nB, top_n, diffexp_lfc_cutoff):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied from diffex_generic. No changes

return result


def mean_var_n(X, X_approximate_distribution=XApproximateDistribution.NORMAL):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied from diffex_generic. No changes

server/dataset/cxg_dataset.py Show resolved Hide resolved
@blrnw3 blrnw3 marked this pull request as ready for review March 14, 2022 17:09

return mean, v
def mean_var_cnt_dense(matrix, _, rows):
# TODO: why is the list cast still necessary? Tdb is throwing an error
Copy link
Contributor

@bkmartinjr bkmartinjr Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the cast to Python list should not be necessary (and it is moderately expensive). If TDB is really throwing an error, I recommend that we chase that down before merging, as it is potentially a bug of some sort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will add to the task list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: this is a bug in recent TileDB-Py versions. In the queue for a fix in TileDB 0.13.2, at which point multi_index should again support either list or ndarray.

@@ -23,7 +23,12 @@
class CxgDataset(Dataset):
# TODO: The tiledb context parameters should be a configuration option
tiledb_ctx = tiledb.Ctx(
{"sm.tile_cache_size": 8 * 1024 * 1024 * 1024, "sm.num_reader_threads": 32, "vfs.s3.region": "us-east-1"}
{
"sm.tile_cache_size": 16 * 1024 ** 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will these changes be accompanied with a change of instance type? The large tile_cache_size and py.init_buffer_bytes are suitable for a high-mem instance, but not for the current 64GB instances. Thought: should these be pulled out into a deployment configuration (eg, set via env vars), or made adaptive (eg, tile cache is X% of RAM on the instance)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there's an infra PR here: https://github.com/chanzuckerberg/single-cell-infra/pull/526
Re: making them dynamic based on instance - agreed!

@@ -265,18 +270,12 @@ def get_X_array(self, obs_mask=None, var_mask=None):
if obs_items == slice(None) and var_items == slice(None):
data = X[:, :]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be an unordered query as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot. Though is this even used. TODO: check if it is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used. Added a commit to simplify the logic, as multi_index[] can provide both query types -- we don't need the branch.

Copy link
Contributor

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are small nits or could be done in future PR (eg, better configuration).

Items that should be resolved pre-merge:

  1. cxg_dataset.py - looks like the X query is missing the unordered param
  2. diffexp_cxg.py - the addition of the list() cast in mean_var_cnt_dense is either a bug or unnecessary (comment implies the former, which should be run to ground).
  3. And as Ben noted, the TDB bug needs to be fixed

@codecov
Copy link

codecov bot commented Mar 28, 2022

Codecov Report

Merging #234 (2c8db9f) into main (f5993b9) will decrease coverage by 0.34%.
The diff coverage is 82.01%.

❗ Current head 2c8db9f differs from pull request most recent head 00e9930. Consider uploading reports for the commit 00e9930 to get more accurate results

@@            Coverage Diff             @@
##             main     #234      +/-   ##
==========================================
- Coverage   76.47%   76.12%   -0.35%     
==========================================
  Files          92       90       -2     
  Lines        6758     6660      -98     
==========================================
- Hits         5168     5070      -98     
  Misses       1590     1590              
Flag Coverage Δ
frontend 76.12% <82.01%> (-0.35%) ⬇️
javascript 76.12% <82.01%> (-0.35%) ⬇️
smokeTest ?
unitTest 76.12% <82.01%> (-0.35%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/common/config/server_config.py 86.63% <ø> (-0.81%) ⬇️
server/dataset/dataset.py 73.30% <66.66%> (-0.29%) ⬇️
server/common/rest.py 80.00% <80.00%> (-0.10%) ⬇️
server/compute/diffexp_cxg.py 80.70% <82.29%> (-3.62%) ⬇️
server/app/api/v3.py 95.86% <100.00%> (+0.17%) ⬆️
server/dataset/cxg_dataset.py 85.71% <100.00%> (-1.66%) ⬇️
compute/diffexp_cxg.py 80.70% <0.00%> (-3.62%) ⬇️
dataset/cxg_dataset.py 85.71% <0.00%> (-1.66%) ⬇️
common/config/server_config.py 86.63% <0.00%> (-0.81%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5993b9...00e9930. Read the comment docs.

@bkmartinjr bkmartinjr marked this pull request as draft March 28, 2022 21:46
@bkmartinjr bkmartinjr marked this pull request as ready for review March 28, 2022 22:33
Copy link
Contributor

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have resolved my own prior comments, but would like additional reviewers for the new modifications.

@bkmartinjr bkmartinjr requested review from ebezzi and removed request for metakuni April 11, 2022 16:01
{
"sm.tile_cache_size": 8 * 1024 ** 3,
"py.init_buffer_bytes": 16 * 1024 ** 3,
"vfs.s3.region": "us-east-1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why the region is even required. Our buckets live in us-west-2 so it looks like this is ignored by TileDB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was just to have an explicit default in the code so we didn't have to rely on a TileDB default. You are right that the current configurations override it to match our current deployment. As or deployment evolves (eg, multi-region?), the config could change.

@@ -244,7 +240,7 @@ def __remap_indices(self, coord_range, coord_mask, coord_data):
if coord_mask is None:
return coord_range, coord_data

indices = np.where(coord_mask)[0]
indices = coord_mask.nonzero()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is this more efficient than using np.where?

Copy link
Contributor

@bkmartinjr bkmartinjr Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. They are literally identical, but np.where is deprecated in this case. See the numpy documentation here: https://numpy.org/doc/stable/reference/generated/numpy.where.html

It says:

Note

When only condition is provided, this function is a shorthand for np.asarray(condition).nonzero(). Using nonzero directly should be preferred, as it behaves correctly for subclasses. The rest of this documentation covers only the case where all three arguments are provided.

Copy link
Contributor

@atolopko-czi atolopko-czi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, with a few minor comments!

@@ -46,7 +46,7 @@ umap-learn==0.4.6
sentry-sdk[flask]==0.14.3
six==1.14.0
sqlalchemy==1.3.18
tiledb==0.13.1
tiledb==0.13.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stick with 0.13.2 or should we try 0.14.x now? @MDunitz has already found issues with 0.14.x, so maybe hold off if 0.13.2 is well-tested?

Either way, we need to keep Explorer at-or-ahead of Portal's TileDB version. And we should keep this version in sync with server/requirements.txt (in this repo).

if var_items == slice(None):
densedata += X_col_shift[:]
else:
densedata += X_col_shift.multi_index[var_items][""]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc, there was but 1 dataset that used X_col_shift and we're regenerating CXGs, so presumably not having to do deal with this moving forward. Even so, are we're validating somewhere that the CXG versions we're admitting no longer support X_col_shift? If not, do we want to remain defensive here and fail loudly if X_col_shift is used?

@@ -20,31 +18,35 @@ def main():
parser.add_argument("-vb", "--varB", help="obs variable:value to use for group B")
parser.add_argument("-t", "--trials", default=1, type=int, help="number of trials")
parser.add_argument(
"-a", "--alg", choices=("default", "generic", "cxg"), default="default", help="algorithm to use"
"-a", "--alg", choices=("cxg",), default="cxg", help="algorithm to use"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with but one choice, can we remove the option altogether?

Bruce Martin and others added 2 commits May 2, 2022 09:10
* first cut at new binary diffex routes

* lint

* clean up benchmark

* update diffex benchmark

* lower memory footprint by using incremental query for diffex

* update to tiledb 0.13.3

* fix misleading comment

* add additional unit tests for new route

* improve stats

* incremental mean and variance calculation to reduce memory footprint

* LINT!
@ebezzi ebezzi changed the title Brodgers/diffex Differential expression performance improvements May 2, 2022
@ebezzi ebezzi merged commit 1e9965a into main May 12, 2022
@ebezzi ebezzi deleted the brodgers/diffex branch May 12, 2022 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants