-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: integrate inverted index into lance index APIs #2577
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2577 +/- ##
==========================================
- Coverage 79.35% 78.99% -0.36%
==========================================
Files 213 215 +2
Lines 62521 62904 +383
Branches 62521 62904 +383
==========================================
+ Hits 49614 49693 +79
- Misses 9996 10302 +306
+ Partials 2911 2909 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
rust/lance/src/dataset/scanner.rs
Outdated
@@ -355,6 +361,15 @@ impl Scanner { | |||
Ok(self) | |||
} | |||
|
|||
pub fn full_text_search(&mut self, column: &str, query: &str) -> Result<&mut Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets have a docstring for the format of the query?
IIRC, tantivy does not need to provide column. It covers all columns in the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tantivy requires to specify the index when search, and have to specify the schema when creating index, so the index know which fields to search
btw, should we change this to columns: IntoIter<AsStr>
?
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
- with this we don't need to sort all the matched docs - also optimizes the BM25 calculation Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there we are mixing two different concepts in an odd way.
The two concepts are "scan filter" and "nearest search".
A "scan filter" fully scans the dataset and returns all rows that satisfy some kind of filter.
A "nearest search" uses an index to determine the K closest row ids and then takes those row ids from the data.
We currently have full text search implemented as some kind of combination between the two. It is a "scan filter" but the filter is "the row is within the top K matches of the query".
Everything sort of works but if you specify either a vector search (nearest
) or a standard filter (filter
) then I think they will combine in unexpected ways.
If a vector search is specified then the full text search will happen first and the vector search will only search the row ids that match the full text search. We should probably either return an error or do some kind of (configurable) reranking (this would be quite complicated).
If a filter is specified then the filter will be a post-filter. In other words, if the user searches for "full_text_query='hello', filter='year>2020',limit=10" then it will first find the 10 best matches for the full text search and then filter those by year>2020
. We might miss matches that include "hello" but score lower than other matches that come before 2020
(false negatives are a common problem with prefiltering). So, for today, we might want to just return an error if a filter
and full_text_query
are present.
Long term, I think we need to treat "full text search" as a completely independent query plan node.
python/python/lance/dataset.py
Outdated
if not pa.types.is_large_string(field.type): | ||
raise TypeError( | ||
f"INVERTED index column {column} must be a large string" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only support FTS on large string and not string? That seems a significant limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to support string
, support large_string
first cause it can cover more cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should make supporting string
high priority. (We should probably also support string view as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added string
python/python/lance/dataset.py
Outdated
@@ -418,6 +425,8 @@ def to_table( | |||
Return row ID. | |||
use_stats: bool, default True | |||
Use stats pushdown during filters. | |||
full_text_query: (str, str), optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation is a little confusing. It seems to imply the input can be a tuple (str, str)
but in the code it looks like you accept str
or dict
. However, later in the rust code, it looks like you only really accept dict
.
If that's the case we should change this to dict
and be clear what the keys are and what each one means. I think, from inspection, you want a query
key (the query string) and a columns
key (the columns to search).
However, it's also not clear what kind of search is performed. Is this a "contains" query? Is it searching for all strings that contain the given query string? Can the query string have spaces? If it does are these treated as multiple words or a combined string? For example, will the query "a b" match the string "b a"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed that doc, it would accept str
or dict
it would search for documents that contains any words of query
string, and rank them with BM25. Say "a b" can match docs contain "a" or "b" or "b a"
builder = builder.full_text_search(full_text_query) | ||
else: | ||
builder = builder.full_text_search(**full_text_query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, from looking at the rust code, the full_text_search
method only accepts a dict
. In that case I don't think you need to pass in **full_text_query
and can just pass in full_text_query
.
Also, if the user provides a str
won't the call on line 343 fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The builder's method is:
def full_text_search(
self,
query: str,
columns: Optional[List[str]] = None,
) -> ScannerBuilder:
"""
Filter rows by full text searching. *Experimental API*,
may remove it after we support to do this within `filter` SQL-like expression
Must create inverted index on the given column before searching,
"""
self._full_text_query = {"query": query, "columns": columns}
return self
so it won't fail.
@@ -1105,7 +1117,12 @@ def cleanup_old_versions( | |||
def create_scalar_index( | |||
self, | |||
column: str, | |||
index_type: Union[Literal["BTREE"], Literal["BITMAP"], Literal["LABEL_LIST"]], | |||
index_type: Union[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document the INVERTED
index? E.g. somewhere around the paragraph starting with There are three types of scalar indices available today
...
/// The maximum number of results to return | ||
pub limit: Option<i64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have any python bindings for limit
(which is ok if intentional).
Also, how does the limit
here compare to the limit
that the scanner already has? I think it might be confusing to have two limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to limit the number of results the index returns for performance, it's not exposed and will always be set to the limit
that scanner has.
For what you mentioned that we should have a FTS plan node, i think this would be removed once we have that.
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@westonpace @wjones127 |
use snafu::{location, Location}; | ||
|
||
#[derive(Debug, Clone)] | ||
pub enum PreFilterSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just moved from knn.rs
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor suggestions but overall approve.
python/python/lance/dataset.py
Outdated
@@ -296,6 +297,15 @@ def scanner( | |||
number of rows (or be empty) if the rows closest to the query do not | |||
match the filter. It's generally good when the filter is not very | |||
selective. | |||
full_text_query: str or dict, optional | |||
query string to search for, the results will be ranked by BM25. | |||
e.g. "hello world", would match documents contains "hello" or "world". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. "hello world", would match documents contains "hello" or "world". | |
e.g. "hello world", would match documents containing "hello" or "world". |
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you check if self.limit
is set here? What happens if user does scanner().limit(...).full_text_search(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need to check it, moved setting limit to the stage of creating plan, also removed setting it from limit()
rust/lance/src/dataset/scanner.rs
Outdated
if let Some(query) = &self.full_text_query { | ||
self.fts(&filter_plan, query).await?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if let Some(query) = &self.full_text_query { | |
self.fts(&filter_plan, query).await?; | |
} |
I don't think this is used? It looks like you call self.fts
on line 953.
roadmap: Full text search (FTS) indices #1195