A lot of noise in semantic search #174

ironllamagirl · 2020-03-27T18:07:04Z

Hi.
Thank you for this great package!
I am trying to use the semantic search example in order to detect sentences belonging to specific topics. I translated the different topics to query sentences to use with the semantic search.

The problem is that I am getting a lot of noise in the results. Many of the 'matching' sentences to the queries I am using have nothing to do with the query yet they still have a smaller distance than other related sentences with bigger distance. Is there a way to avoid this?
I am trying to solve this, as the use-case I am working on requires me to have as few noise sentences as possible, ideally none. I tried to play with the threshold distance but it is hard to guess which one works best.

The initial idea that I had was to apply a simple keyword search on the results of the semantic search to eliminate noise. However I am afraid that I could be losing many 'good' sentences since the list of keywords can't be completely exhaustive. I am afraid of losing good sentences that are semantically similar to the query but without containing any of the keywords I am choosing.

Another potential way to do this is to train a model so that it can tell if a sentence belongs to a certain topic or not - I haven't tried this yet. Could you please share your opinion/suggestions about this? I would really appreciate it.
Thanks!

nreimers · 2020-03-28T10:27:10Z

Hi @ironllamagirl
Can you provide some more information? Which model did you try? What type of data do you have?

What you describe is the same experience I have with sentence embeddings (I tried all common methods), that you get a lot of noise.

You can characterize methods by false positive and false negative rates:
False positive: A dissimilar pair gets a high score and is falsely include in the top 10
False negative: A similar pair gets a low score and is not returned in the top 10 results

TF-IDF / BM-25 has low false positive rate but a high false negative rate, i.e., the result if finds are often relevant, but is sadly misses a lot of relevant pairs.

Sentence embeddings have the opposite characteristic: Low false negative rate, i.e, it finds all relevant pairs, but a high false positive rate, i.e., the top results often contain noise with non-relevant matches.

Currently I evaluate different approach for question based semantic similarity search. I hope that I can soon (~1 month) share some data + results on this with different experiments.

If computationally feasible, I think the best approach is a two step approach:
Step 1) Retrieval: Retrieve with BM-25 and with sentence embedding search top-100 matches
Step 2) Filtering: If possible, use BERT (or similar) and score every (query, candidate_i) pair and select the top-10 results.

For step 2 you would need some training data.

For retrieval with sentence embeddings, it could also make sense to train a model with triplet loss if you have some training data.

Best
Nils Reimers

ironllamagirl · 2020-03-28T20:21:07Z

Hi @nreimers
Thank you for your response.
The task that I have in hand is to retrieve text related to how companies deal with environmental issues/risks, from different text data sources. Ideally match the sentences to different issues related to the environment.
The sample of data that I am working with so far are news articles about a few companies, where some discuss environment-related issues.
So far I started with using the bert-base-nli-mean-tokens model directly, just like you have in the semantic search example. I got a lot of noise.

As an attempt to improve the results, I tried fine-tuning. I fine-tuned the 'bert-base-uncased' model on my custom data, using the triplet loss method. I followed the anchor, positive and negative schema.
I spent some time labeling sentences (related to environment or not), and then I generated all non-repetitive pairs using the sentences labeled as 'related'. These pairs represent the 'anchor' and 'positive' sentences. I then used the 'unrelated sentences' as 'negative'.
The results showed less noise, but the noise still makes up a relatively big percentage of the results.

I am actually not very familiar with the BM25 method for semantic search,. Is its only objective to compute tfidf for documents? There doesn't seem to be a lot of examples online for this method.

In step 1, I assume sentence embedding search is what was implemented in the semantic search example, am I right? So you suggest to append the results from both tfidf and sentence embedding search together and then do the filtering?
Could you please give more explanation as to how the filtering is done using bert? by 'score' do you mean compute the distance between query and candidate? Isn't that what the sentence embedding search is doing as well?

Thanks.

nreimers · 2020-03-29T13:30:46Z

Hi @ironllamagirl
Here some papers you might find interesting:

https://arxiv.org/abs/1907.04780
https://arxiv.org/abs/2002.08909
https://openreview.net/forum?id=rkg-mA4FDr
https://arxiv.org/abs/1905.01969
https://arxiv.org/abs/1811.08008

Also a project that might be interesting for you:
https://github.com/koursaros-ai/nboost

BM25: BM25 is similar to TF-IDF, but often works much better. It takes different document lengths into the consideration. ElasticSearch (which I can quite recommend) is based on BM25 to index and find documents..

An approach that works really good is the approach that is also implemented in Nboost: Neural re-ranking.

The idea is you have two phases: A retrieval phase and a re-ranking phase.

In the retrieval phase, you get for example 100 hits. You could split this in getting 50 hits with Elasticsearch (BM25) and 50 hits with semantic search using sentence embeddings, or you could just get 100 hits with BM25 from Elasticsearch.

In the second step, you apply a more complex model: The re-ranker.

This re-ranker gets as input (query, hit1), (query, hit2), ..., (query, hit100). For each pair, it outputs a value 0...1 about how relevant the pair is. Nboost uses for this BERT which was previously trained on suitable data. It has several pre-trained models, which should generalize quite well to other domains.

The final results are then the top-10 pairs that got the highest score from the re-ranker.

You can find more details on re-ranking here:
https://arxiv.org/pdf/1901.04085.pdf

Best
Nils Reimers

ironllamagirl · 2020-05-07T08:26:42Z

Hi Nils,

Thank you very much for these resources. Very helpful.
I ended up applying BM25 as a second 're-ranker'. It reduced noise marginally, but sacrificed good sentences. I will possibly build a manual dataset for reranking in the future.

Thank you again! I'm closing the issue for now.

braaannigan · 2020-05-07T09:01:05Z

Hi @nreimers @ironllamagirl

Just want to say that I've built a semantic search engine using this wonderful package without re-ranking. I'm not able to release the code at the moment, but wanted to share some pointers:

I'm using 'distilbert-base-nli-stsb-mean-tokens'. I haven't tried the others, but this works for me.
I use nmslib to build the index with hnsw method and a cosinesimilarity metric
think about how well your query sentence embeddings match up with your sentence embeddings. This post here is excellent in describing how your semantic units need to align between your queries and sentences - maybe you need sub-sentence embeddings :https://hanxiao.io/2019/07/29/Generic-Neural-Elastic-Search-From-bert-as-service-and-Go-Way-Beyond/
If you want to visualise this, use umap to map your sentence and query embeddings to 2d and see if the search terms are lining up close to their target sentences

km5ar · 2023-02-06T18:44:27Z

@nreimers

Hi Nils, any update on your evaluation/experiments on different approach for question based semantic similarity search?

km5ar · 2023-02-06T18:45:42Z

@ironllamagirl

so you first use sentence transformers to similarity, then use BM25 for re-ranking?
would you mind share steps or code?

Thanks!

ironllamagirl closed this as completed May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A lot of noise in semantic search #174

A lot of noise in semantic search #174

ironllamagirl commented Mar 27, 2020

nreimers commented Mar 28, 2020

ironllamagirl commented Mar 28, 2020

nreimers commented Mar 29, 2020

ironllamagirl commented May 7, 2020

braaannigan commented May 7, 2020

km5ar commented Feb 6, 2023

km5ar commented Feb 6, 2023

A lot of noise in semantic search #174

A lot of noise in semantic search #174

Comments

ironllamagirl commented Mar 27, 2020

nreimers commented Mar 28, 2020

ironllamagirl commented Mar 28, 2020

nreimers commented Mar 29, 2020

ironllamagirl commented May 7, 2020

braaannigan commented May 7, 2020

km5ar commented Feb 6, 2023

km5ar commented Feb 6, 2023