Skip to content

Retrieve the top-π‘˜ documents with respect to a given query by maximal inner product over dense and sparse vectors

Notifications You must be signed in to change notification settings

zuliani99/Sparse-Dense_Retrieval

Repository files navigation

Sparse-Dense_Retrieval

Retrieve the top-π‘˜ documents with respect to a given query by maximal inner product over dense and sparse vectors. This problem is solved by breaking the maximal inner product int two smaller MIPS problem:

  • Retrieve the top-π‘˜' documents from a sparse retrieval system defined over the sparse portion of the vectors
  • Retrieve the top-π‘˜' documents from a dense retrieval system defined over the dense portion of the vectors

Before merging the two sets and retrieving the top-π‘˜ documents from the combined (much smaller) set. As π‘˜' approaches infinity, we see the final top-π‘˜ ecoming exact, with the drawback that the retrieval becomes much slower.

The dataset that we decide to use are: nfcorpus and scifact

Application Workflow

  • Download the wanted dataset using Beir
  • Pre-processing the queries and documents text
  • Retrieve the sparse embedding using the ElasticSearch implementation of BM25 or the implemented version
  • Retrieve the dense embedding using SentenceBert
  • Obtaining the ground truth score and document rank at k for each query
  • Obtaining the merged embedding using the dense and sparse representation at k'
  • Retrieve the results over the ground truth at k and the merged version at k

Results

  • scifact dataset results
  • nfcorpus dataset results

About

Retrieve the top-π‘˜ documents with respect to a given query by maximal inner product over dense and sparse vectors

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published