Update `tile_size` calculations to be based on the array feature type #435

jparismorgan · 2024-07-05T12:41:26Z

What

Previously in FLAT and IVF FLAT indexes, we would set the IDs array tile_size = TILE_SIZE_BYTES / np.dtype(vector_type).itemsize / dimensions. This results in us having very large tile sizes if we are using uint8 or int8 feature vectors. To work around this, we change to calculate the tile for the IDs array (which is just a vector) based on the data type it stores, which today is hard-coded to uint64.

Testing Python changes

I ran this code before and after making these changes:

def test_flat_index(tmp_path):
    uri = "/tmp/flat_index"
    if shutil.os.path.exists(uri):
        shutil.rmtree(uri)
    vector_type = np.dtype(np.uint8)
    index = flat_index.create(uri=uri, dimensions=3, vector_type=vector_type)

    update_vectors = np.empty([5], dtype=object)
    update_vectors[0] = np.array([0, 0, 0], dtype=vector_type)
    update_vectors[1] = np.array([1, 1, 1], dtype=vector_type)
    update_vectors[2] = np.array([2, 2, 2], dtype=vector_type)
    update_vectors[3] = np.array([3, 3, 3], dtype=vector_type)
    update_vectors[4] = np.array([4, 4, 4], dtype=vector_type)
    index.update_batch(vectors=update_vectors, external_ids=np.array([0, 1, 2, 3, 4]))
    query_and_check(index, np.array([[2, 2, 2]], dtype=np.float32), 3, {1, 2, 3})

    index = index.consolidate_updates()

Here we're using uint8 vectors with 3 dimensions.

Before these changes our shuffled_vector_ids fragment was 253KB (this is because we had np.dtype(vector_type = uint8).itemsize = 1, and so our tile size was 42,666,666 bytes = 40.69 MB):

After these changes our shuffled_vector_ids fragment was 31KB (this is because we had np.dtype(vector_type = uint64).itemsize = 8, and so our tile size was 5,333,333 bytes = 5.0 MB):

I also tested on siftsmall (which has float32 vectors with 128 dimensions):

def test_ingestion_fvec(tmp_path):
    vfs = tiledb.VFS()

    source_uri = siftsmall_inputs_file
    queries_uri = siftsmall_query_file
    gt_uri = siftsmall_groundtruth_file
    k = 100
    partitions = 100
    nqueries = 100
    nprobe = 20
    num_subspaces = 64
    queries = load_fvecs(queries_uri)
    gt_i, gt_d = get_groundtruth_ivec(gt_uri, k=k, nqueries=nqueries)

    for index_type, index_class in zip(INDEXES, INDEX_CLASSES):
        minimum_accuracy = (
            MINIMUM_ACCURACY_IVF_PQ if index_type == "IVF_PQ" else MINIMUM_ACCURACY
        )
        index_uri = f"/tmp/array_{index_type}"
        index = ingest(
            index_type=index_type,
            index_uri=index_uri,
            source_uri=source_uri,
            partitions=partitions,
            num_subspaces=num_subspaces,
        )
        continue

Before these changes our shuffled_vector_ids fragment was 12 KB with np.dtype(vector_type).itemsize = 4, tile_size = 250,000 bytes = .23 MB:

After these changes our shuffled_vector_ids fragment was also 12 KB, even though we have np.dtype(vector_type).itemsize = 8, tile_size = 125,000 bytes = 0.119 MB:

I will say that it seems that our tile size is too small when we have vectors with many dimensions, so perhaps there is actually a better approach than big number / np.dtype(vector_type).itemsize / dimensions). So if you have suggestions, please do let me know.

Testing C++ changes

Here we build a Vamana index on SIFT (1 million 128 dimension float32 vectors):

def test_vamana_ingest():
    sift_base_uri = "/Users/parismorgan/Documents/tiledb/sift/sift_base.fvecs"
    data = load_fvecs(sift_base_uri)
    index_uri = f"/Users/parismorgan/Documents/tiledb/vamana/vamana_index_new"
    ingest(
            index_type="VAMANA",
            index_uri=index_uri,
            input_vectors=data,
            config=tiledb.cloud.Config().dict(),
     )

We have the same size with and without this change (the difference in the tile_size is only 2x (float32 to uint64) versus 8x (uint8 to uint64) and the dataset is larger so whatever the difference in data overflow beyond the last time looks to be negligible.

NikolaosPapailiou · 2024-07-08T08:51:55Z

apis/python/src/tiledb/vector_search/flat_index.py

@@ -196,7 +195,7 @@ def create(
        ids_array_rows_dim = tiledb.Dim(
            name="rows",
            domain=(0, MAX_INT32),
-            tile=tile_size,
+            tile=int(TILE_SIZE_BYTES / np.dtype(np.uint64).itemsize / dimensions),


I think this should be

int(TILE_SIZE_BYTES / np.dtype(np.uint64).itemsize)

as IDs are not a vector but just one value per vector.

However, the main point of having the same tile_size between vectors and IDs is that we always read these arrays with the same slicing queries which are mainly filtering different ivf partitions. Having much more values inside ID tiles can give some optimization in disk space but it might lead in suboptimal read filtering by always over-fetching data on the ID array read.

Okay awesome talked offline - understand now and don't think we need to make this change.

jparismorgan added 2 commits July 5, 2024 11:15

specify ids array tile_size based on ids array attribute datatype

f3f6229

C++ changes

c8eff66

jparismorgan marked this pull request as ready for review July 7, 2024 17:52

jparismorgan requested review from ihnorton, cainamisir and NikolaosPapailiou July 7, 2024 17:52

NikolaosPapailiou reviewed Jul 8, 2024

View reviewed changes

jparismorgan closed this Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `tile_size` calculations to be based on the array feature type #435

Update `tile_size` calculations to be based on the array feature type #435

jparismorgan commented Jul 5, 2024

NikolaosPapailiou Jul 8, 2024

jparismorgan Jul 8, 2024

Update tile_size calculations to be based on the array feature type #435

Update tile_size calculations to be based on the array feature type #435

Conversation

jparismorgan commented Jul 5, 2024

What

Testing Python changes

Testing C++ changes

NikolaosPapailiou Jul 8, 2024

Choose a reason for hiding this comment

jparismorgan Jul 8, 2024

Choose a reason for hiding this comment

Update `tile_size` calculations to be based on the array feature type #435

Update `tile_size` calculations to be based on the array feature type #435