-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning to find similar semantics from embeddings #186
Comments
Let me layout my thought process and what I've tried. We're in the wild-west here and I think it's good to layout all assumptions and decisions. I'm specifically exploring Aquaculture in Bali in this Given:
Perform the following procedure:
There will be a distribution over Tn, so we'll run this procedure multiple times for each Strategy. We should be able to see which strategies are able to surface the most relevant results the fastest with this setup, and we can provide metrics on how fast. For starters, I've implemented a few strategies in Qdrant, and I'll make this code available on a benchmarking branch very soon. My current Strategies are:
In a (currently buggy?) version these results perform as indicated in the following chart. The Y-Axis is Recall The chart suggests that the "Representative" strategy is currently the best. I'm also adding 10 labels at LEAST for every iteration to test some things and to get a reasonable speed through each simulation. This should be removed later. Things are a little loose right now, but I'm sharing in the interest of moving a little faster. Any comments or questions are absolutely welcome! |
Part of the issue is which chips I'm including as positives. I think the original dataset includes those chips with ANY overlap of the aquaculture polygons, but we should check for at least some %chip coverage probably. Let me change that and send it back out. |
Did you try to use eculidean similarity as well? Not sure what is better. Making similarity searches using multiple dimension and vector aritmetics is an unsolved problem in my opinion. It is still a lot of trial and error, and the search queries for combining vectors are often done on the average / sum / substraction of vectors. But that could become more sophisticated potentially. Maybe @leothomas can chime in here too. |
Indeed trial and error beats theory. I understand eculidean similarity is the "distance between the tips of the arrows". It would hence also suffer from confusion of irrelevant dimensions.
I think embeddings average doesn't work for EO embeddings. It works for tokens of words as they are mostly monosemantic, so an average highlight the traits of a singular semantic (e.g. king - men + woman. King seems logical to depend on the semantic of country, men, royal, ... so when you substract men and add women, it seems logical to arrive close to queen. On Remote sensing, however, we create a token per image. If one chip has a green field, and a blue pool, the embedding would need to contain all that. Now imagine it's a lone house in a field, the average embedding of the area will greatly difuse the blue pool. In fact, in our v0.1 we will greatly difuse that pool on the RGB average since it will only appear on the blue band. Polysemtantic prunning then is the method to remove that green field semantics from the embedding, so that the cosine or euclidean similairty indeed measure the distance in dimensions that measure what we care about. |
https://arxiv.org/abs/2003.04151 "Embedding Propagation: Smoother Manifold for Few-Shot Classification" They claim embedding propagation can help with issues of "distribution shift" where the training data isn't distributed similarly to test set. Not sure if this could be helpful or how easily this could be applied for geospatial, but if it works, perhaps an alternative to having to continue model training on a specific region like Bali for instance? |
Tagging here a similar issue with quarries and not getting expected semantic similarity. |
Posting here some general notes as we explore this issue:
|
Basically, after A LOT of fancy exploring, the most effective is simply cosine similarity. More details:
Notes:
|
Closing as out of date, feel free to re-open if appropriate. |
The main practical use case of Clay as of now, and the center of the upcoming app, is the ability to find similar features. Think: 1) Click on a pool, 2) find more potential examples, 3) confirm/reject candidates, 4) iterate until you are happy.
The current chip size
512 pixels
or~5120 m in Sentinel
is much larger than most semantics, or even the patch size32 pixels
or~320 meters
so we corresponding embeddings will incorporate the many semantics present on the chip/patch. This multi-semantics will lead to similarity search (e.g. cosine) or other tools of limited use, since this looks at all dimensions.I believe we need a way to both:
This might take the shape of a "decoder" that is either plug to the encoder or, better, take embeddings as input. Ideally, this decoder is agnostic of the label, or location, and needs no trainning on inference time (so that the app can use it easily).
cc @yellowcap, @geohacker and @srmsoumya for ideas.
The text was updated successfully, but these errors were encountered: