Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implicit casting and resolving literal types for ARRAY when doing similarity search #3248

Closed
prrao87 opened this issue Apr 10, 2024 · 3 comments
Assignees
Labels
frontend Frontend, i.e., binder, parser, query planning-related issues usability Issues related to better usability experience, including bad error messages

Comments

@prrao87
Copy link
Member

prrao87 commented Apr 10, 2024

Consider this similarity search case where I want to return the node whose vector property is nearest to a given query vector provided by the user. I'll use the cosine similarity to demonstrate this.

import os
import shutil
import kuzu

if os.path.exists("./db"):
    shutil.rmtree("./db")

# Create database
db = kuzu.Database("./db")
conn = kuzu.Connection(db)

# Define schema
conn.execute("CREATE NODE TABLE Item(id UINT64, item STRING, price DOUBLE, vector DOUBLE[2], PRIMARY KEY (id))");

# Add data
conn.execute("MERGE (a:Item {id: 1, item: 'apple', price: 2.0, vector: cast([3.1, 4.1], 'DOUBLE[2]')})")
conn.execute("MERGE (b:Item {id: 2, item: 'banana', price: 1.0, vector: cast([5.9, 26.5], 'DOUBLE[2]')})");

# Run similarity search
res = conn.execute("MATCH (a:Item) RETURN a.item, a.price, array_cosine_similarity(a.vector, cast([6.0, 25.0], 'DOUBLE[2]')) AS sim ORDER BY sim DESC")
while res.has_next():
    row = res.get_next()
    print(row)

I want to return banana as the most similar node, based on the provided vector [6.0, 25.0], which should be the closest to the banana vector. I was able to get the above example code to work after some massaging, to return this:

['banana', 1.0, 0.9998642653091405]
['apple', 2.0, 0.9163829638139936]

Verbosity and scope for errors

The fixed-list/array type is currently a bit inconvenient and hard to use for similarity search.
It would be a lot easier if we could simply define this instead (without explicitly performing the cast)

# Add data
conn.execute("MERGE (a:Item {id: 1, item: 'apple', price: 2.0, vector: [3.1, 4.1]})")
conn.execute("MERGE (b:Item {id: 2, item: 'banana', price: 1.0, vector: [5.9, 26.5]})")

# Run similarity search
res = conn.execute("MATCH (a:Item) RETURN a.item, a.price, array_cosine_similarity(a.vector, [6.0, 25.0]) AS sim ORDER BY sim DESC")

Can this be incorporated without any breaking changes to other functionality?

@prrao87 prrao87 added feature New features or missing components of existing features question Further information is requested usability Issues related to better usability experience, including bad error messages frontend Frontend, i.e., binder, parser, query planning-related issues labels Apr 10, 2024
@mxwli
Copy link
Collaborator

mxwli commented Apr 10, 2024

Probably the best way to incorporate this would be to resolve list literals to ARRAYS and allow implicit casting from ARRAYS to LISTS.

@andyfengHKU
Copy link
Contributor

Probably the best way to incorporate this would be to resolve list literals to ARRAYS and allow implicit casting from ARRAYS to LISTS.

Yeah let's just implement this casting rule. Should be straight forward to do so.

@prrao87 prrao87 assigned mxwli and unassigned andyfengHKU and acquamarin Apr 10, 2024
@prrao87 prrao87 removed question Further information is requested feature New features or missing components of existing features labels Apr 10, 2024
@andyfengHKU
Copy link
Contributor

Should be fixed in #3394

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend Frontend, i.e., binder, parser, query planning-related issues usability Issues related to better usability experience, including bad error messages
Projects
None yet
Development

No branches or pull requests

4 participants