-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embeddings search experimental API #1164
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1164 +/- ##
==========================================
+ Coverage 91.26% 91.41% +0.15%
==========================================
Files 80 82 +2
Lines 6329 6463 +134
==========================================
+ Hits 5776 5908 +132
- Misses 553 555 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
6bf1181
to
639e64c
Compare
api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py
Outdated
Show resolved
Hide resolved
""" | ||
|
||
|
||
def find_nearest_obs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the API side, it would be nice if this could produce output that can be directly with sklearn style classes. For example, if this returned a KNNTransformer subclass, that could be used directly with the KNeighborsClassifier and KNeighborsRegressor classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivirshup I like this idea very much, but I'm not quite sure it's workable (albeit I'm not as familiar with those APIs)...
Those scikit-learn classes seem oriented around the scenario where you're providing either all the points (in the "universe") or the complete distance matrix for them. Here we're working with a more limited view of the query points and their neighbor distances; we don't have or want the complete distance matrix, and actually we don't even have the coordinates of the neighbors immediately handy.
Do you think the shoe fits? I see there's some stuff about the "K neighbors graph" that might be relevant, but I'm not personally familiar enough to use them in an unconventional/advanced way like this.
|
||
|
||
def find_nearest_obs( | ||
embedding_metadata: Dict[str, Any], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this use a different way to specify an embedding than get_embedding
does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_embedding
is a little low level in that it wants the full URI to the embeddings TileDB array, which isn't actually needed to find the index. embedding_metadata
is the information returned from get_embedding_metadata()
which seems like the appropriate level (especially in view of #1181 wherein we will actually put the relative URIs to the index arrays in there), although of course it'd be nice if it were more typesafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlin can we change it to?
embedding_name: str,
organism: str,
census_version: str
I think that provides an easier entry point to users and it aligns to get_embedding_metadata_by_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't uniquely refer to an embedding, right? It refers to the newest embedding with this name?
I think we need to be able to non-ambiguously specify an embedding here since it's really important that the embedding being queried matches what we projected our data into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the combination of the three pieces does match to a unique embedding matrix, unless there is an edge-case I'm not catching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for get_embedding_metadata_by_name
says that it can be ambiguous and returns the latest one.
I think the unique value for embeddings is in the "id"
key of the embedding metadata, while "embedding_name"
just says what method was used.
cellxgene-census/api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py
Lines 171 to 172 in 7612fb3
"""Return metadata for a specific embedding. If more embeddings match the query parameters, | |
the most recent one will be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right "id"
is unique, but the IDs are "cxg-czi-[\d+]"
or "cxg-contributed-[\d+]"
-- so no human-friendly.
The reason for If more embeddings match the query parameters, the most recent one will be returned.
is that we ended up having two versions of scVI for the 2023-12-15 Census version, the second effectively being a replacement of the first one (the Yosef lab recommended to that first needed some fixing).
So now refreshing my memory that one is actually the only edge-case we have, but it is intended for our users to use the latest version. I don't anticipate having to have multiple versions of the same embedding with the same name, unless one is the replacement of the other.
api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me except one comment in API signature
|
||
|
||
def find_nearest_obs( | ||
embedding_metadata: Dict[str, Any], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlin can we change it to?
embedding_name: str,
organism: str,
census_version: str
I think that provides an easier entry point to users and it aligns to get_embedding_metadata_by_name
def _resolve_embedding_index( | ||
embedding_metadata: Dict[str, Any], | ||
mirror: Optional[str] = None, | ||
) -> Optional[Tuple[str, str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebezzi new index resolution method here
@ebezzi @pablo-gar @ivirshup Updated this to resolve indexes through mirrors/contributions json and remove the need for caller to use |
api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py
Show resolved
Hide resolved
…_embedding_search.py Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
Adds two new functions to
cellxgene_census.experimental
:find_nearest_obs
uses TileDB-Vector-Search indexes of Census embeddings to find nearest neighbors of given embedding vectors (in an AnnData obsm layer). Census cell similarity search: experimental Python API for searching given AnnData #1114predict_obs_metadata
uses the nearest neighbors to predict metadata attributes like cell_type and tissue_general for the query cells. Naive initial implementation is just a starting point to start experimenting with. Census cell similarity search: experimental Python API for metadata prediction #1115The TileDB-Vector-Search query speed seems to be very S3-latency-sensitive, even moreso than typical Census queries. It's many times faster to run from within AWS us-west-2 than externally.