feat: integrate inverted index into lance index APIs #2577

BubbleCal · 2024-07-09T12:35:13Z

support to create inverted index
support to query with inverted index
python binding
push down limit
roadmap: Full text search (FTS) indices #1195

codecov-commenter · 2024-07-09T12:52:58Z

Codecov Report

Attention: Patch coverage is 41.73765% with 342 lines in your changes missing coverage. Please review.

Project coverage is 78.99%. Comparing base (b45393e) to head (9dbd6b8).

Files	Patch %	Lines
rust/lance/src/io/exec/fts.rs	0.00%	104 Missing ⚠️
rust/lance-index/src/scalar/inverted.rs	64.17%	86 Missing and 10 partials ⚠️
rust/lance/src/dataset/scanner.rs	28.68%	84 Missing and 8 partials ⚠️
rust/lance/src/io/exec/utils.rs	57.14%	0 Missing and 15 partials ⚠️
rust/lance-index/src/scalar.rs	33.33%	14 Missing ⚠️
rust/lance-index/src/scalar/bitmap.rs	0.00%	8 Missing ⚠️
rust/lance/src/index/scalar.rs	30.00%	4 Missing and 3 partials ⚠️
rust/lance-index/src/prefilter.rs	33.33%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2577      +/-   ##
==========================================
- Coverage   79.35%   78.99%   -0.36%     
==========================================
  Files         213      215       +2     
  Lines       62521    62904     +383     
  Branches    62521    62904     +383     
==========================================
+ Hits        49614    49693      +79     
- Misses       9996    10302     +306     
+ Partials     2911     2909       -2

Flag	Coverage Δ
unittests	`78.99% <41.73%> (-0.36%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

eddyxu · 2024-07-13T10:05:48Z

rust/lance/src/dataset/scanner.rs

@@ -355,6 +361,15 @@ impl Scanner {
        Ok(self)
    }

+    pub fn full_text_search(&mut self, column: &str, query: &str) -> Result<&mut Self> {


lets have a docstring for the format of the query?

IIRC, tantivy does not need to provide column. It covers all columns in the index.

tantivy requires to specify the index when search, and have to specify the schema when creating index, so the index know which fields to search
btw, should we change this to columns: IntoIter<AsStr>?

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

- with this we don't need to sort all the matched docs - also optimizes the BM25 calculation Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace

I think there we are mixing two different concepts in an odd way.

The two concepts are "scan filter" and "nearest search".

A "scan filter" fully scans the dataset and returns all rows that satisfy some kind of filter.

A "nearest search" uses an index to determine the K closest row ids and then takes those row ids from the data.

We currently have full text search implemented as some kind of combination between the two. It is a "scan filter" but the filter is "the row is within the top K matches of the query".

Everything sort of works but if you specify either a vector search (nearest) or a standard filter (filter) then I think they will combine in unexpected ways.

If a vector search is specified then the full text search will happen first and the vector search will only search the row ids that match the full text search. We should probably either return an error or do some kind of (configurable) reranking (this would be quite complicated).

If a filter is specified then the filter will be a post-filter. In other words, if the user searches for "full_text_query='hello', filter='year>2020',limit=10" then it will first find the 10 best matches for the full text search and then filter those by year>2020. We might miss matches that include "hello" but score lower than other matches that come before 2020 (false negatives are a common problem with prefiltering). So, for today, we might want to just return an error if a filter and full_text_query are present.

Long term, I think we need to treat "full text search" as a completely independent query plan node.

westonpace · 2024-07-18T13:10:59Z

python/python/lance/dataset.py

+            if not pa.types.is_large_string(field.type):
+                raise TypeError(
+                    f"INVERTED index column {column} must be a large string"
+                )


Do we only support FTS on large string and not string? That seems a significant limitation.

It's possible to support string, support large_string first cause it can cover more cases

I agree we should make supporting string high priority. (We should probably also support string view as well)

added string

westonpace · 2024-07-18T13:16:55Z

python/python/lance/dataset.py

@@ -418,6 +425,8 @@ def to_table(
            Return row ID.
        use_stats: bool, default True
            Use stats pushdown during filters.
+        full_text_query: (str, str), optional


This documentation is a little confusing. It seems to imply the input can be a tuple (str, str) but in the code it looks like you accept str or dict. However, later in the rust code, it looks like you only really accept dict.

If that's the case we should change this to dict and be clear what the keys are and what each one means. I think, from inspection, you want a query key (the query string) and a columns key (the columns to search).

However, it's also not clear what kind of search is performed. Is this a "contains" query? Is it searching for all strings that contain the given query string? Can the query string have spaces? If it does are these treated as multiple words or a combined string? For example, will the query "a b" match the string "b a"?

Fixed that doc, it would accept str or dict

it would search for documents that contains any words of query string, and rank them with BM25. Say "a b" can match docs contain "a" or "b" or "b a"

westonpace · 2024-07-18T13:18:17Z

python/python/lance/dataset.py

+                builder = builder.full_text_search(full_text_query)
+            else:
+                builder = builder.full_text_search(**full_text_query)


I think, from looking at the rust code, the full_text_search method only accepts a dict. In that case I don't think you need to pass in **full_text_query and can just pass in full_text_query.

Also, if the user provides a str won't the call on line 343 fail?

The builder's method is:

def full_text_search( self, query: str, columns: Optional[List[str]] = None, ) -> ScannerBuilder: """ Filter rows by full text searching. *Experimental API*, may remove it after we support to do this within `filter` SQL-like expression Must create inverted index on the given column before searching, """ self._full_text_query = {"query": query, "columns": columns} return self

so it won't fail.

westonpace · 2024-07-18T13:19:52Z

python/python/lance/dataset.py

@@ -1105,7 +1117,12 @@ def cleanup_old_versions(
    def create_scalar_index(
        self,
        column: str,
-        index_type: Union[Literal["BTREE"], Literal["BITMAP"], Literal["LABEL_LIST"]],
+        index_type: Union[


Can you document the INVERTED index? E.g. somewhere around the paragraph starting with There are three types of scalar indices available today...

westonpace · 2024-07-18T13:21:04Z

rust/lance-index/src/scalar.rs

+    /// The maximum number of results to return
+    pub limit: Option<i64>,


I don't think we have any python bindings for limit (which is ok if intentional).

Also, how does the limit here compare to the limit that the scanner already has? I think it might be confusing to have two limits.

This is used to limit the number of results the index returns for performance, it's not exposed and will always be set to the limit that scanner has.
For what you mentioned that we should have a FTS plan node, i think this would be removed once we have that.

rust/lance-index/src/scalar/inverted.rs

rust/lance/src/dataset/scanner.rs

python/python/lance/dataset.py

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2024-07-19T10:06:08Z

@westonpace @wjones127
Just moved the fts to a new FtsExec node, and block being with vector search for now

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2024-07-19T11:28:06Z

rust/lance/src/io/exec/utils.rs

+use snafu::{location, Location};
+
+#[derive(Debug, Clone)]
+pub enum PreFilterSource {


These are just moved from knn.rs

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace

A few minor suggestions but overall approve.

westonpace · 2024-07-22T15:00:40Z

python/python/lance/dataset.py

@@ -296,6 +297,15 @@ def scanner(
            number of rows (or be empty) if the rows closest to the query do not
            match the filter.  It's generally good when the filter is not very
            selective.
+        full_text_query: str or dict, optional
+            query string to search for, the results will be ranked by BM25.
+            e.g. "hello world", would match documents contains "hello" or "world".


Suggested change

e.g. "hello world", would match documents contains "hello" or "world".

e.g. "hello world", would match documents containing "hello" or "world".

westonpace · 2024-07-22T15:06:16Z

rust/lance/src/dataset/scanner.rs

+                }
+            }
+        }
+


Should you check if self.limit is set here? What happens if user does scanner().limit(...).full_text_search(...)?

don't need to check it, moved setting limit to the stage of creating plan, also removed setting it from limit()

westonpace · 2024-07-22T15:07:46Z

rust/lance/src/dataset/scanner.rs

+        if let Some(query) = &self.full_text_query {
+            self.fts(&filter_plan, query).await?;
+        }


Suggested change

if let Some(query) = &self.full_text_query {

self.fts(&filter_plan, query).await?;

}

I don't think this is used? It looks like you call self.fts on line 953.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions bot added the enhancement New feature or request label Jul 9, 2024

eddyxu reviewed Jul 13, 2024

View reviewed changes

feat: integrate inverted index into lance index APIs

67a175c

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal force-pushed the fts-exec branch from 0258ca0 to 67a175c Compare July 15, 2024 03:47

BubbleCal added 4 commits July 15, 2024 11:56

fix

0583292

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

add doc and example

ef52e1c

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

836c740

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

python bindings

698c005

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions bot added the python label Jul 15, 2024

BubbleCal added 7 commits July 15, 2024 16:17

add test

fe0c2d7

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix ut

f1938e2

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix ut

790181b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix ut

b190caf

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

remove unused code

bd10bbc

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fit type hint

499cf5c

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

impl remap

19a8ca4

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested review from eddyxu, westonpace and wjones127 July 15, 2024 14:05

add doc

2513052

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal mentioned this pull request Jul 15, 2024

Full text search (FTS) indices #1195

Open

7 tasks

perf: push down limit to inverted index

b3fdc40

- with this we don't need to sort all the matched docs - also optimizes the BM25 calculation Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal force-pushed the fts-exec branch from fac0e2b to b3fdc40 Compare July 17, 2024 13:02

BubbleCal added 4 commits July 17, 2024 21:09

fmt

f7c7577

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix expression

5c8c4b6

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

optimize

a5809e2

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

be4fd53

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace reviewed Jul 18, 2024

View reviewed changes

fix

7126c9b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal added 4 commits July 18, 2024 23:06

simplify

9b2dbbe

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

support string

5632911

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix string

6f7205b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix example

7d80ea9

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal requested a review from westonpace July 19, 2024 08:50

BubbleCal added 4 commits July 19, 2024 17:28

add FtsExec plan node

365638f

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into fts-exec

ed17ee2

fix

761a146

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

support prefiltering

5246efc

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal added 2 commits July 19, 2024 18:09

fmt

6c0b4fb

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

5dd639f

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal commented Jul 19, 2024

View reviewed changes

BubbleCal added 2 commits July 22, 2024 16:59

Merge branch 'main' of https://github.com/lancedb/lance into fts-exec

d9b4ce7

sort posting list by row id

58cfff6

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace approved these changes Jul 22, 2024

View reviewed changes

fix

9dbd6b8

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal merged commit e571229 into lancedb:main Jul 23, 2024
21 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate inverted index into lance index APIs #2577

feat: integrate inverted index into lance index APIs #2577

BubbleCal commented Jul 9, 2024 •

edited

Loading

codecov-commenter commented Jul 9, 2024 •

edited

Loading

eddyxu Jul 13, 2024

BubbleCal Jul 15, 2024

westonpace left a comment

westonpace Jul 18, 2024

BubbleCal Jul 18, 2024

wjones127 Jul 18, 2024 •

edited

Loading

BubbleCal Jul 19, 2024

westonpace Jul 18, 2024

BubbleCal Jul 18, 2024

westonpace Jul 18, 2024

BubbleCal Jul 18, 2024

westonpace Jul 18, 2024

westonpace Jul 18, 2024

BubbleCal Jul 18, 2024

BubbleCal commented Jul 19, 2024

BubbleCal Jul 19, 2024

westonpace left a comment

westonpace Jul 22, 2024

westonpace Jul 22, 2024

BubbleCal Jul 22, 2024

westonpace Jul 22, 2024

		/// The maximum number of results to return
		pub limit: Option<i64>,

	e.g. "hello world", would match documents contains "hello" or "world".
	e.g. "hello world", would match documents containing "hello" or "world".

	if let Some(query) = &self.full_text_query {
	self.fts(&filter_plan, query).await?;
	}

feat: integrate inverted index into lance index APIs #2577

feat: integrate inverted index into lance index APIs #2577

Conversation

BubbleCal commented Jul 9, 2024 • edited Loading

codecov-commenter commented Jul 9, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BubbleCal commented Jul 19, 2024

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BubbleCal commented Jul 9, 2024 •

edited

Loading

codecov-commenter commented Jul 9, 2024 •

edited

Loading

wjones127 Jul 18, 2024 •

edited

Loading