Diffex improvements: final steps to be done #235

blrnw3 · 2022-03-14T22:33:44Z

Revised plan to pull out separate work-streams.

NOTE: dependencies are explicitly declared (eg, item 2 is dependent on item 1, and items 2,3,4, and 5 can be done in parallel).

Resolve possibly blocking tiledb bugs:
- Dependency: none
- Owner: @bkmartinjr
- TileDB 0.13.1 dense read bug. Workaround inserted into PR bump tiledb to 0.13.1, add work-around for dense read bug #237. Expect fix in 0.13.2
- expensive list cast of the ndarray in the dense read - see
  Differential expression performance improvements #234 (comment). Test case reported to TileDB, and fix expected in 0.13.2. Work-around is currently in place.
CXG Schema Update - goal: peformance tune the CXG X schema
- Dependency unblocked: item 1
- Owner: @ebezzi
- Add TiledB dense issue workaround to data portal. Add {"sm.query.dense.reader": "legacy"} to the TileDB context used for reading/writing dense arrays (has no effect on sparse read/write) OBSOLETE with tiledb update.
- Confirm with Bruce that CXG schema is finalized (there are some divergences from original recommendations that need to be verified).
- Finish and land changes to the Data Portal PR #2119 and merge it.
- Create script to upgrade all legacy CXGs, which would wrap this cxg_remaster.py script (part of DP PR 2119 above). In principle, the steps are, for all CXGs:
  1. backup existing X array for all CXGs
  2. create X_new using above script
  3. swap the X arrays (X_new becomes X; X is deleted)
- Convert all legacy CXGs. For each environment, suggest using an adhoc c6i.32xlarge w read/write S3 perms to the bucket. After each step, verify Explorer still works well on a small sample of the converted CXGs.
  - Run conversion script on all legacy CXGs in DEV.
  - run conversion script on all legacy CXGs in Staging.
  - run conversion script on all legacy CXGs in PROD
  - See detailed plan here.
Diffex algo tuning in Explorer - goal: faster diffex compute in explorer backend.
- Dependency unblocked: item 1
- Owner: @bkmartinjr
- PR finished Differential expression performance improvements #234
- PR Differential expression performance improvements #234 reviewed & landed
update deployment instance type and CDN config
- Dependency unblocked: none
- Owner: @atolopko-czi
- finalize instance type. Note: Bruce has benchmark results that suggests we could deploy something smaller than a c6i.32xlarge (perhaps c6i.16xlarge).
- define HA policy for Dev & Staging and add to Infra PR #526. In particular, why have redundant (HA) config for dev & staging given the $$ impact?
- Update Explorer backend config to match instance characteristics. Specifically: tile cache and python query buffer values are dependent on instance RAM.
- Increase the connection read timeouts from 30s to 60s. This requires changes to CloudFront and LB.
- Finish & merge Infra PR #526
Integrate new diffex postings list encoding into explorer f/e and b/e.
- Dependency unblocked: none
- Owner: @bkmartinjr
- PR ready and out for review (Add high performance diffex routes #244)
- PR reviewed & landed
System and performance test
- Dependency unblocked: 1, 2, 3, 4, 5
- Owner: @bkmartinjr
- Synthetic CLI benchmark (ie, not using front-end) to verify that b/e perf is as expected.
Deploy diffexp limit update:
- Dependency unblocked: 6
- Owner: @atolopko-czi
- Merge Infra PR #525
Analytics support
- Dependency unblocked: items 7, 8
- Owner: @tihuan
- PR - adding the Plausible tags so that our UXR team can also observe the usage of differential expression
Remove tiledb 0.13.1 work-arounds
- Dependency unblocked: TileDB 0.13.2 shipped and fixes verified.
- Owner: @bkmartinjr

Ben's original notes for context

Apologies that I couldn't finish off this work. Here are the remaining steps as I see them

Steps to do, in order

Apply Bruce’s new encoding scheme to the diffex routes in Explorer
- Bruce has context on this
Investigate probable tiledb 0.13 bug - see this PR description. This may be the same bug that the WMG team is facing. It needs fixing before proceeding with any of the following
- Update: this is not the same bug as WMG found. It is a bug that only affects dense array reading. Workaround is to set the config option: {"sm.query.dense.reader": "legacy"}
- TileDB expects to have an released fix in their next release (2 weeks)
- PR bump tiledb to 0.13.1, add work-around for dense read bug #237 bumps tiledb version and inserts workaround (and a test case to validate)
Investigate second tiledb bug which requires an expensive list cast of the ndarray in the dense read - see comment in the explorer PR below
Merge infra PRs
1. https://github.com/chanzuckerberg/single-cell-infra/pull/526
2. https://github.com/chanzuckerberg/single-cell-infra/pull/525
Merge data portal PR
Merge Explorer PR
Use this cxg_remaster.py script (part of DP PR above) to update all staging CXGs to the optimized tiledb schema
1. Use an adhoc c6i.32xlarge w read/write S3 perms to the bucket
2. I've tested the script in staging but only converted a few of them. It should just be able to run to completion on its own though, once you start it
3. The final step is to swap all the X_new remastered arrays over to X (backup X as X_old first). This should likely be scripted out
Once explorer changes are in staging, test them. Diffex on 1.5M cells should take <20s if using a c6i.32xlarge (pick the biggest dataset that will load)
Repeat steps 7-8 for prod

Cooldown story: add feature-level analytics support so we can gather metrics on diffex usage

The text was updated successfully, but these errors were encountered:

blrnw3 · 2022-03-14T22:35:45Z

@maniarathi here's the remaining diffex work. I tagged all our tiledb experts and will defer to the team to see who gets the privilege of doing the work :D.
Thanks everyone

atolopko-czi · 2022-03-14T22:42:45Z

@maniarathi here's the remaining diffex work. I tagged all our tiledb experts and will defer to the team to see who gets the privilege of doing the work :D. Thanks everyone

Thanks @blrnw3! 👋

maniarathi · 2022-03-15T16:10:25Z

@blrnw3 Thank you!

bkmartinjr · 2022-03-25T16:42:40Z

@atolopko-czi @maniarathi - could you review the revised plan, and add owners/tasks/etc?

Please note the parallelism possible. Much of the remaining work can occur in parallel.

maniarathi · 2022-03-25T20:50:02Z

@atolopko-czi @metakuni @tihuan , I have made a suggestion on how to split up the work. Let me know if this looks OK to you.

bkmartinjr · 2022-03-25T23:53:42Z

@atolopko-czi @metakuni @tihuan , I have made a suggestion on how to split up the work. Let me know if this looks OK to you.

Related: item one is complete (bugs reported, temporary work-arounds in place), so the work for items 2, 3, 4, and 5 are unblocked.

tihuan · 2022-04-07T20:14:51Z

For item 8 analytics support, do we know what we want to track already or still need to define that?

Thank you!

bkmartinjr · 2022-04-07T20:40:20Z

For item 8 analytics support, do we know what we want to track already or still need to define that?

CC: @maniarathi - I believe you added this. Is there a spec?

@tihuan - if not, I suggest we keep it simple:

dataset ID
num cells in selection set
time to complete (from f/e perspective)
success/failure (ie, 200 response or not)

tihuan · 2022-04-07T21:10:04Z

Sounds great thank you! I'm thinking maybe if those can be tracked in BE it'll be more reliable. Unless some of the info is only available in FE (besides the time to complete)?

maniarathi · 2022-04-11T17:30:19Z

@bkmartinjr @tihuan This was adding the Plausible tags so that our UXR team can also observe the usage of differential expression.

tihuan · 2022-04-11T18:16:28Z

Thanks @maniarathi ! Plausible tags to record the metrics Bruce mentioned above or something else?

Plausible also allows sending events from the server for reliable tracking 🙆‍♂️

bkmartinjr · 2022-04-11T18:26:23Z

I recommend client-side events as this allows the capture of failures and end-to-end timing as well. We will be blind to many of those without doing this on the client-side.

blrnw3 assigned atolopko-czi, bkmartinjr and metakuni Mar 14, 2022

bkmartinjr mentioned this issue Mar 28, 2022

Differential expression performance improvements #234

Merged

bkmartinjr assigned ebezzi and unassigned metakuni Apr 11, 2022

atolopko-czi removed their assignment May 9, 2022

ebezzi closed this as completed May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diffex improvements: final steps to be done #235

Diffex improvements: final steps to be done #235

blrnw3 commented Mar 14, 2022 •

edited by ebezzi

Loading

blrnw3 commented Mar 14, 2022 •

edited

Loading

atolopko-czi commented Mar 14, 2022

maniarathi commented Mar 15, 2022

bkmartinjr commented Mar 25, 2022

maniarathi commented Mar 25, 2022

bkmartinjr commented Mar 25, 2022 •

edited

Loading

tihuan commented Apr 7, 2022

bkmartinjr commented Apr 7, 2022

tihuan commented Apr 7, 2022 •

edited

Loading

maniarathi commented Apr 11, 2022

tihuan commented Apr 11, 2022

bkmartinjr commented Apr 11, 2022

Diffex improvements: final steps to be done #235

Diffex improvements: final steps to be done #235

Comments

blrnw3 commented Mar 14, 2022 • edited by ebezzi Loading

Ben's original notes for context

Steps to do, in order

blrnw3 commented Mar 14, 2022 • edited Loading

atolopko-czi commented Mar 14, 2022

maniarathi commented Mar 15, 2022

bkmartinjr commented Mar 25, 2022

maniarathi commented Mar 25, 2022

bkmartinjr commented Mar 25, 2022 • edited Loading

tihuan commented Apr 7, 2022

bkmartinjr commented Apr 7, 2022

tihuan commented Apr 7, 2022 • edited Loading

maniarathi commented Apr 11, 2022

tihuan commented Apr 11, 2022

bkmartinjr commented Apr 11, 2022

blrnw3 commented Mar 14, 2022 •

edited by ebezzi

Loading

blrnw3 commented Mar 14, 2022 •

edited

Loading

bkmartinjr commented Mar 25, 2022 •

edited

Loading

tihuan commented Apr 7, 2022 •

edited

Loading