Update tile_size
calculations to be based on the array feature type
#435
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
Previously in FLAT and IVF FLAT indexes, we would set the IDs array
tile_size = TILE_SIZE_BYTES / np.dtype(vector_type).itemsize / dimensions
. This results in us having very large tile sizes if we are usinguint8
orint8
feature vectors. To work around this, we change to calculate the tile for the IDs array (which is just a vector) based on the data type it stores, which today is hard-coded touint64
.Testing Python changes
I ran this code before and after making these changes:
Here we're using
uint8
vectors with 3 dimensions.Before these changes our
shuffled_vector_ids
fragment was 253KB (this is because we hadnp.dtype(vector_type = uint8).itemsize = 1
, and so our tile size was 42,666,666 bytes = 40.69 MB):After these changes our
shuffled_vector_ids
fragment was 31KB (this is because we hadnp.dtype(vector_type = uint64).itemsize = 8
, and so our tile size was 5,333,333 bytes = 5.0 MB):I also tested on siftsmall (which has
float32
vectors with 128 dimensions):Before these changes our
shuffled_vector_ids
fragment was 12 KB withnp.dtype(vector_type).itemsize = 4, tile_size = 250,000 bytes = .23 MB
:After these changes our
shuffled_vector_ids
fragment was also 12 KB, even though we havenp.dtype(vector_type).itemsize = 8, tile_size = 125,000 bytes = 0.119 MB
:I will say that it seems that our tile size is too small when we have vectors with many dimensions, so perhaps there is actually a better approach than
big number / np.dtype(vector_type).itemsize / dimensions)
. So if you have suggestions, please do let me know.Testing C++ changes
Here we build a Vamana index on SIFT (1 million 128 dimension float32 vectors):
We have the same size with and without this change (the difference in the tile_size is only 2x (
float32
touint64
) versus 8x (uint8
touint64
) and the dataset is larger so whatever the difference in data overflow beyond the last time looks to be negligible.