-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full text search (FTS) indices #1195
Comments
Maybe worth a look when we implement this: https://github.com/huggingface/tokenizers |
Got some user feedback on potential API ideas we might want: https://discord.com/channels/1030247538198061086/1197630499926057021/1238721206006317066 |
What is this forWith the capability of full text search, we can retrieve the document data more efficient, and with BM25 we can rank the results to reach better retrieval quality. DesignThe index consists of 3 parts:
We divide the index structure into the three files because it allows us to minimize IO:
Filteringthe execution plan is: For now, we support only to do either FTS or vector search, in the future, we may add a rerank node to score the rows outputed by FTS & vector search to gain higher retrieval quality UpdatesAs described above, the index consists of three parts, we'd copy the The TODO itemsFeatures
Low Priority
APIs
Docs
Additional items
|
To get it work as soon as possible, I haven't integrated it into the filter expression, instead, just added a new interface to execute the full text search, may remove this interface once we get the parser ready. Here is a Python example: import random
import lance
import pyarrow as pa
import string
import tempfile
# generate dataset
n = 1000
ids = range(n)
docs = ["".join(random.choices(string.ascii_letters, k=5)) for _ in range(n)]
id_array = pa.array(ids, type=pa.int64())
# the inverted index supports large string array only
doc_array = pa.array(docs, type=pa.large_string())
table = pa.table({"id": id_array, "doc": doc_array})
temp_dir = tempfile.mkdtemp()
dataset = lance.write_dataset(table, temp_dir)
dataset.create_scalar_index("doc", "INVERTED")
results = dataset.scanner(
["id", "doc"],
limit=10,
full_text_query=docs[0],
).to_table()
print(results) |
Sept 9:
Sept 2:
Aug 26th
Reduce index file size and improve the indexing performance
Given that we have https://github.com/lancedb/tantivy-object-store ready now, we can start to integrate tantive FTS into the rust core, and offer FTS to js/python/rust bindings.
Because we need to work on a variety of storage systems, we will likely need to vendor and adapt tantivy to meet our needs. Many of the components, such as the tokenizer and scoring can be re-used as is.
The text was updated successfully, but these errors were encountered: