-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differential expression performance improvements #234
Conversation
@@ -78,7 +78,7 @@ | |||
|
|||
limits: | |||
column_request_max: 32 | |||
diffexp_cellcount_max: 50000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the setting for prod is in single-cell-infra, PR incoming
|
||
|
||
def diffexp_ttest(adaptor, maskA, maskB, top_n=8, diffexp_lfc_cutoff=0.01, arr="X"): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment moved from diffex_generic
is_sparse = matrix.schema.sparse | ||
|
||
if is_sparse: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
custom parallelism logic, unneeded
|
||
def diffexp_ttest_from_mean_var(meanA, varA, nA, meanB, varB, nB, top_n, diffexp_lfc_cutoff): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied from diffex_generic
. No changes
return result | ||
|
||
|
||
def mean_var_n(X, X_approximate_distribution=XApproximateDistribution.NORMAL): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied from diffex_generic
. No changes
server/compute/diffexp_cxg.py
Outdated
|
||
return mean, v | ||
def mean_var_cnt_dense(matrix, _, rows): | ||
# TODO: why is the list cast still necessary? Tdb is throwing an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the cast to Python list
should not be necessary (and it is moderately expensive). If TDB is really throwing an error, I recommend that we chase that down before merging, as it is potentially a bug of some sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, will add to the task list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: this is a bug in recent TileDB-Py versions. In the queue for a fix in TileDB 0.13.2, at which point multi_index should again support either list or ndarray.
server/dataset/cxg_dataset.py
Outdated
@@ -23,7 +23,12 @@ | |||
class CxgDataset(Dataset): | |||
# TODO: The tiledb context parameters should be a configuration option | |||
tiledb_ctx = tiledb.Ctx( | |||
{"sm.tile_cache_size": 8 * 1024 * 1024 * 1024, "sm.num_reader_threads": 32, "vfs.s3.region": "us-east-1"} | |||
{ | |||
"sm.tile_cache_size": 16 * 1024 ** 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will these changes be accompanied with a change of instance type? The large tile_cache_size and py.init_buffer_bytes are suitable for a high-mem instance, but not for the current 64GB instances. Thought: should these be pulled out into a deployment configuration (eg, set via env vars), or made adaptive (eg, tile cache is X% of RAM on the instance)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there's an infra PR here: https://github.com/chanzuckerberg/single-cell-infra/pull/526
Re: making them dynamic based on instance - agreed!
server/dataset/cxg_dataset.py
Outdated
@@ -265,18 +270,12 @@ def get_X_array(self, obs_mask=None, var_mask=None): | |||
if obs_items == slice(None) and var_items == slice(None): | |||
data = X[:, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be an unordered query as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot. Though is this even used. TODO: check if it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used. Added a commit to simplify the logic, as multi_index[] can provide both query types -- we don't need the branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my comments are small nits or could be done in future PR (eg, better configuration).
Items that should be resolved pre-merge:
- cxg_dataset.py - looks like the X query is missing the unordered param
- diffexp_cxg.py - the addition of the list() cast in mean_var_cnt_dense is either a bug or unnecessary (comment implies the former, which should be run to ground).
- And as Ben noted, the TDB bug needs to be fixed
Codecov Report
@@ Coverage Diff @@
## main #234 +/- ##
==========================================
- Coverage 76.47% 76.12% -0.35%
==========================================
Files 92 90 -2
Lines 6758 6660 -98
==========================================
- Hits 5168 5070 -98
Misses 1590 1590
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have resolved my own prior comments, but would like additional reviewers for the new modifications.
{ | ||
"sm.tile_cache_size": 8 * 1024 ** 3, | ||
"py.init_buffer_bytes": 16 * 1024 ** 3, | ||
"vfs.s3.region": "us-east-1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why the region is even required. Our buckets live in us-west-2
so it looks like this is ignored by TileDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was just to have an explicit default in the code so we didn't have to rely on a TileDB default. You are right that the current configurations override it to match our current deployment. As or deployment evolves (eg, multi-region?), the config could change.
@@ -244,7 +240,7 @@ def __remap_indices(self, coord_range, coord_mask, coord_data): | |||
if coord_mask is None: | |||
return coord_range, coord_data | |||
|
|||
indices = np.where(coord_mask)[0] | |||
indices = coord_mask.nonzero()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is this more efficient than using np.where
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. They are literally identical, but np.where is deprecated in this case. See the numpy documentation here: https://numpy.org/doc/stable/reference/generated/numpy.where.html
It says:
Note
When only condition is provided, this function is a shorthand for np.asarray(condition).nonzero(). Using nonzero directly should be preferred, as it behaves correctly for subclasses. The rest of this documentation covers only the case where all three arguments are provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, with a few minor comments!
hosted/requirements.txt
Outdated
@@ -46,7 +46,7 @@ umap-learn==0.4.6 | |||
sentry-sdk[flask]==0.14.3 | |||
six==1.14.0 | |||
sqlalchemy==1.3.18 | |||
tiledb==0.13.1 | |||
tiledb==0.13.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stick with 0.13.2 or should we try 0.14.x now? @MDunitz has already found issues with 0.14.x, so maybe hold off if 0.13.2 is well-tested?
Either way, we need to keep Explorer at-or-ahead of Portal's TileDB version. And we should keep this version in sync with server/requirements.txt
(in this repo).
if var_items == slice(None): | ||
densedata += X_col_shift[:] | ||
else: | ||
densedata += X_col_shift.multi_index[var_items][""] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, there was but 1 dataset that used X_col_shift
and we're regenerating CXGs, so presumably not having to do deal with this moving forward. Even so, are we're validating somewhere that the CXG versions we're admitting no longer support X_col_shift
? If not, do we want to remain defensive here and fail loudly if X_col_shift
is used?
@@ -20,31 +18,35 @@ def main(): | |||
parser.add_argument("-vb", "--varB", help="obs variable:value to use for group B") | |||
parser.add_argument("-t", "--trials", default=1, type=int, help="number of trials") | |||
parser.add_argument( | |||
"-a", "--alg", choices=("default", "generic", "cxg"), default="default", help="algorithm to use" | |||
"-a", "--alg", choices=("cxg",), default="cxg", help="algorithm to use" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with but one choice, can we remove the option altogether?
* first cut at new binary diffex routes * lint * clean up benchmark * update diffex benchmark * lower memory footprint by using incremental query for diffex * update to tiledb 0.13.3 * fix misleading comment * add additional unit tests for new route * improve stats * incremental mean and variance calculation to reduce memory footprint * LINT!
Reviewers
Functional: @atolopko-czi
Readability: @ebezzi
Changes