Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioned HNSW Deeplake Side Changes. #2847

Merged
merged 24 commits into from
Jun 13, 2024
Merged

Partitioned HNSW Deeplake Side Changes. #2847

merged 24 commits into from
Jun 13, 2024

Conversation

sounakr
Copy link
Contributor

@sounakr sounakr commented May 8, 2024

🚀 🚀 Pull Request

This PR is the deeplake side implementation of the Partitioned HNSW. In case of Partititoned HNSW we divide the HNSW into number of partition. This is done when the data is large and it has to scale. HNSW is not scalable, so in order to accommodate large of of data Partitioning is a way out.
Partitions are defined in index params. For e.g. we are creating 5 partitions and if the dataset is having 1000000 rows then each partition will have 200000 rows.

Through VectorStore API.
vs = VectorStore(
path=dest,
exec_option="compute_engine",
index_params={"threshold": 1, "distance_metric": "COS", "additional_params": {
"efConstruction": 200,
"M": 16,
"partitions": 5,
}},
token = TOKEN,
verbose=True,
overwrite= True,
)

Through Deeplake API.
ds = vs.dataset.
params = {
"efConstruction": 200,
"M": 16,
"partitions": 32,
}
ds.embedding.create_vdb_index("hnsw_1", distance="cosine_similarity", additional_params = params)

While doing query there is no change and TQL will be fired to all the partitions simultenously. The best match will be responded back.

Incremental index maintenance is enabled for partitioned hnsw. In case of new row Addition, Update or Remove of Top most rows the partitioned hnsw is automatically maintained.

In order to delete the partitioned hnsw index
ds.embedding.delete_vdb_index("hnsw_1")

Impact

Partitioned indexes are much faster to create and have high recall impact. Whenever indexing has to be done at scale, this feature is helpful.

@CLAassistant
Copy link

CLAassistant commented May 27, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 4 committers have signed the CLA.

✅ activesoull
✅ sounakr
✅ khustup2
❌ azat-manukyan
You have signed the CLA already but the status is still pending? Let us recheck it.

@sounakr sounakr marked this pull request as ready for review June 3, 2024 02:06
@sounakr sounakr requested a review from nvoxland-al June 3, 2024 02:06
@sounakr sounakr changed the title [WIP]Partitioned HNSW Deeplake Side Changes. Partitioned HNSW Deeplake Side Changes. Jun 5, 2024
Copy link

sonarcloud bot commented Jun 13, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
7.4% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

@khustup2 khustup2 merged commit 31959f2 into main Jun 13, 2024
7 of 10 checks passed
@khustup2 khustup2 deleted the partitioned_hnsw branch June 13, 2024 10:16
nvoxland-al pushed a commit that referenced this pull request Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants