Store content hash to avoid embedding duplicate data #217

simonw · 2023-09-03T15:39:00Z

Maybe there should be a mechanism where it can keep a hash of the content in the database table, such that it can avoid duplicate lookups of already-embedded content based on that hash?

This is a tiny bit more expensive in terms of storage and compute, but would save a LOT of money against paid embedding APIs.

I think a 16 byte md5 in a BLOB column would be fine. I'm already storing the embeddings themselves which are MUCH larger than that.

sha256 would be 32 bytes. I don't think there are any security concerns here for using MD5.

I'll make the hash calculation a method on the Collection class that people can subclass and over-ride if they need to.

Originally posted by @simonw in #215 (comment)

The text was updated successfully, but these errors were encountered:

simonw · 2023-09-03T15:39:43Z

Since it's only 16 bytes I'm going to store this for every embedding.

For the migration... how do I backfill existing rows? Since I haven't shipped the feature yet there should be very few of them out there, so I'm going to use a random hash value which will cause them to definitely be re-embedded next time it runs (since I likely didn't store the content).

Uses new migrations feature from simonw/sqlite-migrate#9

simonw · 2023-09-03T17:51:37Z

embeddings table schema now includes a content_hash table which contains the MD5 of the embedded content.

llm/docs/embeddings/python-api.md

Lines 165 to 174 in a5d6b58

    
           CREATE TABLE "embeddings" ( 
        
              [collection_id] INTEGER REFERENCES [collections]([id]), 
        
              [id] TEXT, 
        
              [embedding] BLOB, 
        
              [content] TEXT, 
        
              [content_hash] BLOB, 
        
              [metadata] TEXT, 
        
              [updated] INTEGER, 
        
              PRIMARY KEY ([collection_id], [id]) 
        
           )

Plus this index:

CREATE INDEX [idx_embeddings_content_hash]
    ON [embeddings] ([content_hash]);

I didn't make that index unique because the same piece of content might be stored in more than one collection, resulting in multiple rows in the table.

A migration backfills this for all existing rows, setting it to a random MD5 hash for rows that did not store content.

simonw · 2023-09-03T17:56:59Z

Next I need to upgrade the various .embed() methods to skip embedding content that already exists.

Also refs simonw/sqlite-utils#589

simonw · 2023-09-03T19:57:54Z

I'll implement the next step as part of llm embed-multi in #215 because that's where the code that efficiently de-dupes things gets interesting (given the need to count stuff to be embedded for the progress bar).

simonw · 2023-09-04T00:39:30Z

I was going to build that here but I guess I'll mostly retire that tool instead:

Mechanism for re-calculating embeddings only for changed text openai-to-sqlite#8

Refs #192, #209, #211, #213, #215, #217, #218, #219, #222

Refs #192, #209, #211, #213, #215, #217, #218, #219, #222 Closes #205

simonw added enhancement New feature or request embeddings labels Sep 3, 2023

simonw added this to the 0.9 - embeddings milestone Sep 3, 2023

simonw mentioned this issue Sep 3, 2023

Ability to apply migrations up to a point simonw/sqlite-migrate#9

Closed

simonw added a commit that referenced this issue Sep 3, 2023

Store content_hash in embeddings table, refs #217

a5d6b58

Uses new migrations feature from simonw/sqlite-migrate#9

simonw added a commit that referenced this issue Sep 3, 2023

De-register functions after migration, refs #217

87af2dd

Also refs simonw/sqlite-utils#589

simonw added a commit that referenced this issue Sep 3, 2023

Populate content_hash with embed_multi, refs #217

156bed7

simonw closed this as completed in 3bf781f Sep 4, 2023

simonw mentioned this issue Sep 4, 2023

Mechanism for re-calculating embeddings only for changed text simonw/openai-to-sqlite#8

Closed

simonw added a commit that referenced this issue Sep 4, 2023

Fixed test I broke in #217

408297f

simonw added a commit that referenced this issue Sep 4, 2023

Release 0.9a1

c9f7299

Refs #192, #209, #211, #213, #215, #217, #218, #219, #222

simonw added a commit that referenced this issue Sep 4, 2023

Release 0.9

5efb300

Refs #192, #209, #211, #213, #215, #217, #218, #219, #222 Closes #205

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store content hash to avoid embedding duplicate data #217

Store content hash to avoid embedding duplicate data #217

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 4, 2023

Store content hash to avoid embedding duplicate data #217

Store content hash to avoid embedding duplicate data #217

Comments

simonw commented Sep 3, 2023 • edited Loading

simonw commented Sep 3, 2023 • edited Loading

simonw commented Sep 3, 2023 • edited Loading

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023 • edited Loading

simonw commented Sep 4, 2023

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 3, 2023 •

edited

Loading