Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full text search (FTS) indices #1195

Open
4 of 7 tasks
Tracked by #2079
eddyxu opened this issue Aug 31, 2023 · 4 comments
Open
4 of 7 tasks
Tracked by #2079

Full text search (FTS) indices #1195

eddyxu opened this issue Aug 31, 2023 · 4 comments
Assignees
Labels
arrow Apache Arrow related issues rust Rust related tasks

Comments

@eddyxu
Copy link
Contributor

eddyxu commented Aug 31, 2023

Sept 9:

Sept 2:

Aug 26th
Reduce index file size and improve the indexing performance

Given that we have https://github.com/lancedb/tantivy-object-store ready now, we can start to integrate tantive FTS into the rust core, and offer FTS to js/python/rust bindings.

Because we need to work on a variety of storage systems, we will likely need to vendor and adapt tantivy to meet our needs. Many of the components, such as the tokenizer and scoring can be re-used as is.

@eddyxu eddyxu added arrow Apache Arrow related issues rust Rust related tasks labels Aug 31, 2023
@wjones127 wjones127 changed the title [Rust] Integrate with Tantive Rust crate Full text search (FTS) indices Mar 12, 2024
@wjones127 wjones127 added this to the (WIP) Lance Roadmap milestone Mar 12, 2024
@wjones127 wjones127 mentioned this issue Mar 15, 2024
20 tasks
@wjones127
Copy link
Contributor

Maybe worth a look when we implement this: https://github.com/huggingface/tokenizers

@wjones127
Copy link
Contributor

Got some user feedback on potential API ideas we might want: https://discord.com/channels/1030247538198061086/1197630499926057021/1238721206006317066

@BubbleCal
Copy link
Contributor

BubbleCal commented Jul 15, 2024

What is this for

With the capability of full text search, we can retrieve the document data more efficient, and with BM25 we can rank the results to reach better retrieval quality.

Design

The index consists of 3 parts: TokenSet, InvertedList and DocSet. We will store them on 3 individual files.

  • TokenSet is basically a map from token(word) to token_id and the frequency of it.
  • InvertedList records the row_id and frequency each token occurs for fast retrieving
  • DocSet records the number of tokens for each row, which is used for BM25 scoring

We divide the index structure into the three files because it allows us to minimize IO:

  • For appending, we need to load only the previous TokenSet and generate InvertedList and DocSet for the new data
  • For filtering, we need to load only the TokenSet and InvertedList
  • For filtering and ranking with BM25, we need to load all the files

Filtering

the execution plan is:
image
The filtering generates a RowIdMask that presents the rows should be included, the FTS node would predicate each row by RowIdMask and takes k documents with highest scores within the included rows.

For now, we support only to do either FTS or vector search, in the future, we may add a rerank node to score the rows outputed by FTS & vector search to gain higher retrieval quality

Updates

As described above, the index consists of three parts, we'd copy the TokenSet from previous built index, then update it if needs. The TokenSet is almost a dictionary of words, so it won't be costly to copy it.

The InvertedList would be constructed by the new TokenSet as indexing, so does the DocSet.

TODO items

Features

Low Priority

  • Brute-force search if index not created
  • Index on multiple fields
  • Query on multiple fields

APIs

Docs

  • Benchmark
  • Docs
  • Examples

Additional items

  • More languages support

@BubbleCal BubbleCal self-assigned this Jul 15, 2024
@BubbleCal
Copy link
Contributor

BubbleCal commented Jul 15, 2024

To get it work as soon as possible, I haven't integrated it into the filter expression, instead, just added a new interface to execute the full text search, may remove this interface once we get the parser ready. Here is a Python example:

import random
import lance
import pyarrow as pa
import string
import tempfile

# generate dataset
n = 1000
ids = range(n)
docs = ["".join(random.choices(string.ascii_letters, k=5)) for _ in range(n)]

id_array = pa.array(ids, type=pa.int64())
# the inverted index supports large string array only
doc_array = pa.array(docs, type=pa.large_string())

table = pa.table({"id": id_array, "doc": doc_array})
temp_dir = tempfile.mkdtemp()
dataset = lance.write_dataset(table, temp_dir)
dataset.create_scalar_index("doc", "INVERTED")

results = dataset.scanner(
    ["id", "doc"],
    limit=10,
    full_text_query=docs[0],
).to_table()
print(results)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Apache Arrow related issues rust Rust related tasks
Projects
None yet
Development

No branches or pull requests

6 participants