Full text search (FTS) indices #1195

eddyxu · 2023-08-31T18:31:09Z

Sept 9:

https://github.com/lancedb/sophon/issues/2381 @BubbleCal

Sept 2:

https://github.com/lancedb/sophon/issues/2353 @BubbleCal
Add FTS to reindexer and add tests to SaaS integration tests https://github.com/lancedb/sophon/pull/2301 @BubbleCal
https://github.com/lancedb/sophon/issues/2381 @BubbleCal

Aug 26th
Reduce index file size and improve the indexing performance

Reduce index file size to reduce the cold latency chore: remove global frequency of tokens and sort tokens lexicographically #2786
Improve loading FTS index performance perf: concurrent loading FTS index files #2787
improve indexing performance (less than 15min for MS MARCO 22GB dataset) perf: parallelize FTS indexing #2807

Given that we have https://github.com/lancedb/tantivy-object-store ready now, we can start to integrate tantive FTS into the rust core, and offer FTS to js/python/rust bindings.

Because we need to work on a variety of storage systems, we will likely need to vendor and adapt tantivy to meet our needs. Many of the components, such as the tokenizer and scoring can be re-used as is.

wjones127 · 2024-05-03T21:12:49Z

Maybe worth a look when we implement this: https://github.com/huggingface/tokenizers

wjones127 · 2024-05-13T16:23:29Z

Got some user feedback on potential API ideas we might want: https://discord.com/channels/1030247538198061086/1197630499926057021/1238721206006317066

BubbleCal · 2024-07-15T16:34:34Z

BubbleCal · 2024-07-15T17:11:33Z

To get it work as soon as possible, I haven't integrated it into the filter expression, instead, just added a new interface to execute the full text search, may remove this interface once we get the parser ready. Here is a Python example:

import random
import lance
import pyarrow as pa
import string
import tempfile

# generate dataset
n = 1000
ids = range(n)
docs = ["".join(random.choices(string.ascii_letters, k=5)) for _ in range(n)]

id_array = pa.array(ids, type=pa.int64())
# the inverted index supports large string array only
doc_array = pa.array(docs, type=pa.large_string())

table = pa.table({"id": id_array, "doc": doc_array})
temp_dir = tempfile.mkdtemp()
dataset = lance.write_dataset(table, temp_dir)
dataset.create_scalar_index("doc", "INVERTED")

results = dataset.scanner(
    ["id", "doc"],
    limit=10,
    full_text_query=docs[0],
).to_table()
print(results)

eddyxu assigned westonpace, chebbyChefNEQ and wjones127 Aug 31, 2023

eddyxu added arrow Apache Arrow related issues rust Rust related tasks labels Aug 31, 2023

wjones127 changed the title ~~[Rust] Integrate with Tantive Rust crate~~ Full text search (FTS) indices Mar 12, 2024

wjones127 added this to the (WIP) Lance Roadmap milestone Mar 12, 2024

wjones127 mentioned this issue Mar 15, 2024

Roadmap 2024 #2079

Open

20 tasks

BubbleCal mentioned this issue Jul 15, 2024

feat: integrate inverted index into lance index APIs #2577

Merged

BubbleCal self-assigned this Jul 15, 2024

This was referenced Aug 26, 2024

Tantivy FTS support at the Rust level lancedb/lancedb#672

Open

feat: support fuzzy matching in full text search lancedb/lancedb#1563

Closed

QianZhu removed this from the (WIP) Lance Roadmap milestone Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full text search (FTS) indices #1195

Full text search (FTS) indices #1195

eddyxu commented Aug 31, 2023 •

edited by BubbleCal

Loading

wjones127 commented May 3, 2024

wjones127 commented May 13, 2024

BubbleCal commented Jul 15, 2024 •

edited

Loading

BubbleCal commented Jul 15, 2024 •

edited

Loading

Full text search (FTS) indices #1195

Full text search (FTS) indices #1195

Comments

eddyxu commented Aug 31, 2023 • edited by BubbleCal Loading

wjones127 commented May 3, 2024

wjones127 commented May 13, 2024

BubbleCal commented Jul 15, 2024 • edited Loading

What is this for

Design

Filtering

Updates

TODO items

Features

Low Priority

APIs

Docs

Additional items

BubbleCal commented Jul 15, 2024 • edited Loading

eddyxu commented Aug 31, 2023 •

edited by BubbleCal

Loading

BubbleCal commented Jul 15, 2024 •

edited

Loading

BubbleCal commented Jul 15, 2024 •

edited

Loading