Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Regression in cohere-10m force merge latency after switching to NativeEngines990KnnVectorsWriter #2134

Open
shatejas opened this issue Sep 20, 2024 · 2 comments
Assignees
Labels
bug Something isn't working untriaged

Comments

@shatejas
Copy link
Contributor

What is the bug?

After the switch to NativeEngines990KnnVectorsWriter we saw force merge latencies increased approximately by 20% in nightly runs

The increase has been consistent

How can one reproduce the bug?
Running a benchmark for force-merge against 2.17 vs 2.16 with cohere 10m dataset takes 2000seconds (~30 mins) more

What is the expected behavior?
It should take the same time or less

What is your host/environment?
Nightlies dashboard

Do you have any screenshots?
NA

Do you have any additional context?
NA

@shatejas shatejas added bug Something isn't working untriaged labels Sep 20, 2024
@shatejas
Copy link
Contributor Author

shatejas commented Sep 20, 2024

To reproduce easily cohere 1m dataset was used for benchmarking for the below table
 

  number of index segments force merge time (minutes) force merge segments
2.17 code 75 15.68263 3
#2133 code 92 12.51252 3
2.16 code 88 15.68209 3

Estimated bottlenecks

KNNVectorValues Creation

KNNVectorValues are created 3 times currently, we cannot reuse the same object as there is no way we could reset the iterator and putting effort into logic for resetting the iterator might not result in latency improvements

  1. Computing totalLiveDocs
  2. Training for quantization
  3. Building index

Currently we are creating KNNVectorValues when quantization is not needed. Exp 2 in the above table shows some improvement in force merge time

TotalLiveDocs computes

There is a linear time complexity to compute total live docs. TotalLiveDocs value is currently needed to

  1. Mean calculations during quantization training
  2. Memory allocation computations while building graph for HNSW

Flush case

For flush we can avoid this calculation as there are no deleted docs involved and we can rely on KNNVectorValues or vectors in the field to give us the right result for totalLiveDocs

Merge case

Merge involves removing deleted docs, While merging the segments the deleted docs aren’t considered. To do that current code path is using APIs in MergedVectorValues to have an iterator that can iterate while skipping the deleted docs. The APIs here does not give an iterator which considered deleted docs in its size count. As a result even KNNVectorValues cannot return the right result as it relies on the iterator provided by the MergedVectorValues to compute total live docs

@navneet1v
Copy link
Collaborator

@shatejas one way to avoid the linear complexity for totalLives does when there are no deleted docs is we can write our custom FloatVectorValues merger. We already have something like this in BinaryDocValues: ref: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN80Codec/KNN80DocValuesReader.java#L38-L64

If we do this, we can remove the complexity for total live docs. We can also add some clone interfaces on those merged values which will help us remove complexity from code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
None yet
Development

No branches or pull requests

2 participants