Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store content hash to avoid embedding duplicate data #217

Closed
simonw opened this issue Sep 3, 2023 · 5 comments
Closed

Store content hash to avoid embedding duplicate data #217

simonw opened this issue Sep 3, 2023 · 5 comments
Labels
embeddings enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Sep 3, 2023

Maybe there should be a mechanism where it can keep a hash of the content in the database table, such that it can avoid duplicate lookups of already-embedded content based on that hash?

This is a tiny bit more expensive in terms of storage and compute, but would save a LOT of money against paid embedding APIs.

I think a 16 byte md5 in a BLOB column would be fine. I'm already storing the embeddings themselves which are MUCH larger than that.

sha256 would be 32 bytes. I don't think there are any security concerns here for using MD5.

I'll make the hash calculation a method on the Collection class that people can subclass and over-ride if they need to.

Originally posted by @simonw in #215 (comment)

@simonw simonw added enhancement New feature or request embeddings labels Sep 3, 2023
@simonw simonw added this to the 0.9 - embeddings milestone Sep 3, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

Since it's only 16 bytes I'm going to store this for every embedding.

For the migration... how do I backfill existing rows? Since I haven't shipped the feature yet there should be very few of them out there, so I'm going to use a random hash value which will cause them to definitely be re-embedded next time it runs (since I likely didn't store the content).

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

embeddings table schema now includes a content_hash table which contains the MD5 of the embedded content.

CREATE TABLE "embeddings" (
[collection_id] INTEGER REFERENCES [collections]([id]),
[id] TEXT,
[embedding] BLOB,
[content] TEXT,
[content_hash] BLOB,
[metadata] TEXT,
[updated] INTEGER,
PRIMARY KEY ([collection_id], [id])
)

Plus this index:

CREATE INDEX [idx_embeddings_content_hash]
    ON [embeddings] ([content_hash]);

I didn't make that index unique because the same piece of content might be stored in more than one collection, resulting in multiple rows in the table.

A migration backfills this for all existing rows, setting it to a random MD5 hash for rows that did not store content.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

Next I need to upgrade the various .embed() methods to skip embedding content that already exists.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

I'll implement the next step as part of llm embed-multi in #215 because that's where the code that efficiently de-dupes things gets interesting (given the need to count stuff to be embedded for the progress bar).

@simonw
Copy link
Owner Author

simonw commented Sep 4, 2023

I was going to build that here but I guess I'll mostly retire that tool instead:

simonw added a commit that referenced this issue Sep 4, 2023
simonw added a commit that referenced this issue Sep 4, 2023
simonw added a commit that referenced this issue Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
embeddings enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant