Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffex improvements: final steps to be done #235

Closed
17 of 40 tasks
blrnw3 opened this issue Mar 14, 2022 · 12 comments
Closed
17 of 40 tasks

Diffex improvements: final steps to be done #235

blrnw3 opened this issue Mar 14, 2022 · 12 comments
Assignees

Comments

@blrnw3
Copy link
Contributor

blrnw3 commented Mar 14, 2022

Revised plan to pull out separate work-streams.

NOTE: dependencies are explicitly declared (eg, item 2 is dependent on item 1, and items 2,3,4, and 5 can be done in parallel).

  1. Resolve possibly blocking tiledb bugs:

  2. CXG Schema Update - goal: peformance tune the CXG X schema

    • Dependency unblocked: item 1
    • Owner: @ebezzi
    • Add TiledB dense issue workaround to data portal. Add {"sm.query.dense.reader": "legacy"} to the TileDB context used for reading/writing dense arrays (has no effect on sparse read/write) OBSOLETE with tiledb update.
    • Confirm with Bruce that CXG schema is finalized (there are some divergences from original recommendations that need to be verified).
    • Finish and land changes to the Data Portal PR #2119 and merge it.
    • Create script to upgrade all legacy CXGs, which would wrap this cxg_remaster.py script (part of DP PR 2119 above). In principle, the steps are, for all CXGs:
      1. backup existing X array for all CXGs
      2. create X_new using above script
      3. swap the X arrays (X_new becomes X; X is deleted)
    • Convert all legacy CXGs. For each environment, suggest using an adhoc c6i.32xlarge w read/write S3 perms to the bucket. After each step, verify Explorer still works well on a small sample of the converted CXGs.
      • Run conversion script on all legacy CXGs in DEV.
      • run conversion script on all legacy CXGs in Staging.
      • run conversion script on all legacy CXGs in PROD
      • See detailed plan here.
  3. Diffex algo tuning in Explorer - goal: faster diffex compute in explorer backend.

  4. update deployment instance type and CDN config

    • Dependency unblocked: none
    • Owner: @atolopko-czi
    • finalize instance type. Note: Bruce has benchmark results that suggests we could deploy something smaller than a c6i.32xlarge (perhaps c6i.16xlarge).
    • define HA policy for Dev & Staging and add to Infra PR #526. In particular, why have redundant (HA) config for dev & staging given the $$ impact?
    • Update Explorer backend config to match instance characteristics. Specifically: tile cache and python query buffer values are dependent on instance RAM.
    • Increase the connection read timeouts from 30s to 60s. This requires changes to CloudFront and LB.
    • Finish & merge Infra PR #526
  5. Integrate new diffex postings list encoding into explorer f/e and b/e.

  6. System and performance test

    • Dependency unblocked: 1, 2, 3, 4, 5
    • Owner: @bkmartinjr
    • Synthetic CLI benchmark (ie, not using front-end) to verify that b/e perf is as expected.
  7. Deploy diffexp limit update:

  8. Analytics support

    • Dependency unblocked: items 7, 8
    • Owner: @tihuan
    • PR - adding the Plausible tags so that our UXR team can also observe the usage of differential expression
  9. Remove tiledb 0.13.1 work-arounds

    • Dependency unblocked: TileDB 0.13.2 shipped and fixes verified.
    • Owner: @bkmartinjr

Ben's original notes for context

Apologies that I couldn't finish off this work. Here are the remaining steps as I see them

Steps to do, in order

  1. Apply Bruce’s new encoding scheme to the diffex routes in Explorer
    • Bruce has context on this
  2. Investigate probable tiledb 0.13 bug - see this PR description. This may be the same bug that the WMG team is facing. It needs fixing before proceeding with any of the following
    • Update: this is not the same bug as WMG found. It is a bug that only affects dense array reading. Workaround is to set the config option: {"sm.query.dense.reader": "legacy"}
    • TileDB expects to have an released fix in their next release (2 weeks)
    • PR bump tiledb to 0.13.1, add work-around for dense read bug #237 bumps tiledb version and inserts workaround (and a test case to validate)
  3. Investigate second tiledb bug which requires an expensive list cast of the ndarray in the dense read - see comment in the explorer PR below
  4. Merge infra PRs
    1. https://github.com/chanzuckerberg/single-cell-infra/pull/526
    2. https://github.com/chanzuckerberg/single-cell-infra/pull/525
  5. Merge data portal PR
  6. Merge Explorer PR
  7. Use this cxg_remaster.py script (part of DP PR above) to update all staging CXGs to the optimized tiledb schema
    1. Use an adhoc c6i.32xlarge w read/write S3 perms to the bucket
    2. I've tested the script in staging but only converted a few of them. It should just be able to run to completion on its own though, once you start it
    3. The final step is to swap all the X_new remastered arrays over to X (backup X as X_old first). This should likely be scripted out
  8. Once explorer changes are in staging, test them. Diffex on 1.5M cells should take <20s if using a c6i.32xlarge (pick the biggest dataset that will load)
  9. Repeat steps 7-8 for prod

Cooldown story: add feature-level analytics support so we can gather metrics on diffex usage

@blrnw3
Copy link
Contributor Author

blrnw3 commented Mar 14, 2022

@maniarathi here's the remaining diffex work. I tagged all our tiledb experts and will defer to the team to see who gets the privilege of doing the work :D.
Thanks everyone

@atolopko-czi
Copy link
Contributor

@maniarathi here's the remaining diffex work. I tagged all our tiledb experts and will defer to the team to see who gets the privilege of doing the work :D. Thanks everyone

Thanks @blrnw3! 👋

@maniarathi
Copy link
Contributor

@blrnw3 Thank you!

@bkmartinjr
Copy link
Contributor

@atolopko-czi @maniarathi - could you review the revised plan, and add owners/tasks/etc?

Please note the parallelism possible. Much of the remaining work can occur in parallel.

@maniarathi
Copy link
Contributor

@atolopko-czi @metakuni @tihuan , I have made a suggestion on how to split up the work. Let me know if this looks OK to you.

@bkmartinjr
Copy link
Contributor

bkmartinjr commented Mar 25, 2022

@atolopko-czi @metakuni @tihuan , I have made a suggestion on how to split up the work. Let me know if this looks OK to you.

Related: item one is complete (bugs reported, temporary work-arounds in place), so the work for items 2, 3, 4, and 5 are unblocked.

@tihuan
Copy link
Contributor

tihuan commented Apr 7, 2022

For item 8 analytics support, do we know what we want to track already or still need to define that?

Thank you!

@bkmartinjr
Copy link
Contributor

For item 8 analytics support, do we know what we want to track already or still need to define that?

CC: @maniarathi - I believe you added this. Is there a spec?

@tihuan - if not, I suggest we keep it simple:

  • dataset ID
  • num cells in selection set
  • time to complete (from f/e perspective)
  • success/failure (ie, 200 response or not)

@tihuan
Copy link
Contributor

tihuan commented Apr 7, 2022

Sounds great thank you! I'm thinking maybe if those can be tracked in BE it'll be more reliable. Unless some of the info is only available in FE (besides the time to complete)?

@bkmartinjr bkmartinjr assigned ebezzi and unassigned metakuni Apr 11, 2022
@maniarathi
Copy link
Contributor

@bkmartinjr @tihuan This was adding the Plausible tags so that our UXR team can also observe the usage of differential expression.

@tihuan
Copy link
Contributor

tihuan commented Apr 11, 2022

Thanks @maniarathi ! Plausible tags to record the metrics Bruce mentioned above or something else?

Plausible also allows sending events from the server for reliable tracking 🙆‍♂️

@bkmartinjr
Copy link
Contributor

I recommend client-side events as this allows the capture of failures and end-to-end timing as well. We will be blind to many of those without doing this on the client-side.

@atolopko-czi atolopko-czi removed their assignment May 9, 2022
@ebezzi ebezzi closed this as completed May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants