Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: integrate inverted index into lance index APIs #2577

Merged
merged 32 commits into from
Jul 23, 2024

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Jul 9, 2024

@github-actions github-actions bot added the enhancement New feature or request label Jul 9, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jul 9, 2024

Codecov Report

Attention: Patch coverage is 41.73765% with 342 lines in your changes missing coverage. Please review.

Project coverage is 78.99%. Comparing base (b45393e) to head (9dbd6b8).

Files Patch % Lines
rust/lance/src/io/exec/fts.rs 0.00% 104 Missing ⚠️
rust/lance-index/src/scalar/inverted.rs 64.17% 86 Missing and 10 partials ⚠️
rust/lance/src/dataset/scanner.rs 28.68% 84 Missing and 8 partials ⚠️
rust/lance/src/io/exec/utils.rs 57.14% 0 Missing and 15 partials ⚠️
rust/lance-index/src/scalar.rs 33.33% 14 Missing ⚠️
rust/lance-index/src/scalar/bitmap.rs 0.00% 8 Missing ⚠️
rust/lance/src/index/scalar.rs 30.00% 4 Missing and 3 partials ⚠️
rust/lance-index/src/prefilter.rs 33.33% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2577      +/-   ##
==========================================
- Coverage   79.35%   78.99%   -0.36%     
==========================================
  Files         213      215       +2     
  Lines       62521    62904     +383     
  Branches    62521    62904     +383     
==========================================
+ Hits        49614    49693      +79     
- Misses       9996    10302     +306     
+ Partials     2911     2909       -2     
Flag Coverage Δ
unittests 78.99% <41.73%> (-0.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -355,6 +361,15 @@ impl Scanner {
Ok(self)
}

pub fn full_text_search(&mut self, column: &str, query: &str) -> Result<&mut Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have a docstring for the format of the query?

IIRC, tantivy does not need to provide column. It covers all columns in the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tantivy requires to specify the index when search, and have to specify the schema when creating index, so the index know which fields to search
btw, should we change this to columns: IntoIter<AsStr>?

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal mentioned this pull request Jul 15, 2024
7 tasks
- with this we don't need to sort all the matched docs
- also optimizes the BM25 calculation

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there we are mixing two different concepts in an odd way.

The two concepts are "scan filter" and "nearest search".

A "scan filter" fully scans the dataset and returns all rows that satisfy some kind of filter.

A "nearest search" uses an index to determine the K closest row ids and then takes those row ids from the data.

We currently have full text search implemented as some kind of combination between the two. It is a "scan filter" but the filter is "the row is within the top K matches of the query".

Everything sort of works but if you specify either a vector search (nearest) or a standard filter (filter) then I think they will combine in unexpected ways.

If a vector search is specified then the full text search will happen first and the vector search will only search the row ids that match the full text search. We should probably either return an error or do some kind of (configurable) reranking (this would be quite complicated).

If a filter is specified then the filter will be a post-filter. In other words, if the user searches for "full_text_query='hello', filter='year>2020',limit=10" then it will first find the 10 best matches for the full text search and then filter those by year>2020. We might miss matches that include "hello" but score lower than other matches that come before 2020 (false negatives are a common problem with prefiltering). So, for today, we might want to just return an error if a filter and full_text_query are present.

Long term, I think we need to treat "full text search" as a completely independent query plan node.

Query-Plan-with-FTS

Comment on lines 1268 to 1271
if not pa.types.is_large_string(field.type):
raise TypeError(
f"INVERTED index column {column} must be a large string"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only support FTS on large string and not string? That seems a significant limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to support string, support large_string first cause it can cover more cases

Copy link
Contributor

@wjones127 wjones127 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should make supporting string high priority. (We should probably also support string view as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added string

@@ -418,6 +425,8 @@ def to_table(
Return row ID.
use_stats: bool, default True
Use stats pushdown during filters.
full_text_query: (str, str), optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation is a little confusing. It seems to imply the input can be a tuple (str, str) but in the code it looks like you accept str or dict. However, later in the rust code, it looks like you only really accept dict.

If that's the case we should change this to dict and be clear what the keys are and what each one means. I think, from inspection, you want a query key (the query string) and a columns key (the columns to search).

However, it's also not clear what kind of search is performed. Is this a "contains" query? Is it searching for all strings that contain the given query string? Can the query string have spaces? If it does are these treated as multiple words or a combined string? For example, will the query "a b" match the string "b a"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed that doc, it would accept str or dict

it would search for documents that contains any words of query string, and rank them with BM25. Say "a b" can match docs contain "a" or "b" or "b a"

Comment on lines +343 to +345
builder = builder.full_text_search(full_text_query)
else:
builder = builder.full_text_search(**full_text_query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, from looking at the rust code, the full_text_search method only accepts a dict. In that case I don't think you need to pass in **full_text_query and can just pass in full_text_query.

Also, if the user provides a str won't the call on line 343 fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builder's method is:

    def full_text_search(
        self,
        query: str,
        columns: Optional[List[str]] = None,
    ) -> ScannerBuilder:
        """
        Filter rows by full text searching. *Experimental API*,
        may remove it after we support to do this within `filter` SQL-like expression

        Must create inverted index on the given column before searching,
        """
        self._full_text_query = {"query": query, "columns": columns}
        return self

so it won't fail.

@@ -1105,7 +1117,12 @@ def cleanup_old_versions(
def create_scalar_index(
self,
column: str,
index_type: Union[Literal["BTREE"], Literal["BITMAP"], Literal["LABEL_LIST"]],
index_type: Union[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the INVERTED index? E.g. somewhere around the paragraph starting with There are three types of scalar indices available today...

Comment on lines +109 to +110
/// The maximum number of results to return
pub limit: Option<i64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any python bindings for limit (which is ok if intentional).

Also, how does the limit here compare to the limit that the scanner already has? I think it might be confusing to have two limits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to limit the number of results the index returns for performance, it's not exposed and will always be set to the limit that scanner has.
For what you mentioned that we should have a FTS plan node, i think this would be removed once we have that.

rust/lance-index/src/scalar/inverted.rs Outdated Show resolved Hide resolved
rust/lance-index/src/scalar/inverted.rs Outdated Show resolved Hide resolved
rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved
rust/lance/src/dataset/scanner.rs Outdated Show resolved Hide resolved
python/python/lance/dataset.py Outdated Show resolved Hide resolved
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal
Copy link
Contributor Author

@westonpace @wjones127
Just moved the fts to a new FtsExec node, and block being with vector search for now

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
use snafu::{location, Location};

#[derive(Debug, Clone)]
pub enum PreFilterSource {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just moved from knn.rs

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions but overall approve.

@@ -296,6 +297,15 @@ def scanner(
number of rows (or be empty) if the rows closest to the query do not
match the filter. It's generally good when the filter is not very
selective.
full_text_query: str or dict, optional
query string to search for, the results will be ranked by BM25.
e.g. "hello world", would match documents contains "hello" or "world".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
e.g. "hello world", would match documents contains "hello" or "world".
e.g. "hello world", would match documents containing "hello" or "world".

}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you check if self.limit is set here? What happens if user does scanner().limit(...).full_text_search(...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to check it, moved setting limit to the stage of creating plan, also removed setting it from limit()

Comment on lines 934 to 936
if let Some(query) = &self.full_text_query {
self.fts(&filter_plan, query).await?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if let Some(query) = &self.full_text_query {
self.fts(&filter_plan, query).await?;
}

I don't think this is used? It looks like you call self.fts on line 953.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal merged commit e571229 into lancedb:main Jul 23, 2024
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants