Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tile_size calculations to be based on the array feature type #435

Closed
wants to merge 2 commits into from

Conversation

jparismorgan
Copy link
Collaborator

What

Previously in FLAT and IVF FLAT indexes, we would set the IDs array tile_size = TILE_SIZE_BYTES / np.dtype(vector_type).itemsize / dimensions. This results in us having very large tile sizes if we are using uint8 or int8 feature vectors. To work around this, we change to calculate the tile for the IDs array (which is just a vector) based on the data type it stores, which today is hard-coded to uint64.

Testing Python changes

I ran this code before and after making these changes:

def test_flat_index(tmp_path):
    uri = "/tmp/flat_index"
    if shutil.os.path.exists(uri):
        shutil.rmtree(uri)
    vector_type = np.dtype(np.uint8)
    index = flat_index.create(uri=uri, dimensions=3, vector_type=vector_type)

    update_vectors = np.empty([5], dtype=object)
    update_vectors[0] = np.array([0, 0, 0], dtype=vector_type)
    update_vectors[1] = np.array([1, 1, 1], dtype=vector_type)
    update_vectors[2] = np.array([2, 2, 2], dtype=vector_type)
    update_vectors[3] = np.array([3, 3, 3], dtype=vector_type)
    update_vectors[4] = np.array([4, 4, 4], dtype=vector_type)
    index.update_batch(vectors=update_vectors, external_ids=np.array([0, 1, 2, 3, 4]))
    query_and_check(index, np.array([[2, 2, 2]], dtype=np.float32), 3, {1, 2, 3})

    index = index.consolidate_updates()

Here we're using uint8 vectors with 3 dimensions.

Before these changes our shuffled_vector_ids fragment was 253KB (this is because we had np.dtype(vector_type = uint8).itemsize = 1, and so our tile size was 42,666,666 bytes = 40.69 MB):

Screenshot 2024-07-05 at 11 23 27 AM

After these changes our shuffled_vector_ids fragment was 31KB (this is because we had np.dtype(vector_type = uint64).itemsize = 8, and so our tile size was 5,333,333 bytes = 5.0 MB):

Screenshot 2024-07-05 at 11 22 24 AM

I also tested on siftsmall (which has float32 vectors with 128 dimensions):

def test_ingestion_fvec(tmp_path):
    vfs = tiledb.VFS()

    source_uri = siftsmall_inputs_file
    queries_uri = siftsmall_query_file
    gt_uri = siftsmall_groundtruth_file
    k = 100
    partitions = 100
    nqueries = 100
    nprobe = 20
    num_subspaces = 64
    queries = load_fvecs(queries_uri)
    gt_i, gt_d = get_groundtruth_ivec(gt_uri, k=k, nqueries=nqueries)

    for index_type, index_class in zip(INDEXES, INDEX_CLASSES):
        minimum_accuracy = (
            MINIMUM_ACCURACY_IVF_PQ if index_type == "IVF_PQ" else MINIMUM_ACCURACY
        )
        index_uri = f"/tmp/array_{index_type}"
        index = ingest(
            index_type=index_type,
            index_uri=index_uri,
            source_uri=source_uri,
            partitions=partitions,
            num_subspaces=num_subspaces,
        )
        continue

Before these changes our shuffled_vector_ids fragment was 12 KB with np.dtype(vector_type).itemsize = 4, tile_size = 250,000 bytes = .23 MB:

Screenshot 2024-07-05 at 12 02 09 PM

After these changes our shuffled_vector_ids fragment was also 12 KB, even though we have np.dtype(vector_type).itemsize = 8, tile_size = 125,000 bytes = 0.119 MB:

Screenshot 2024-07-05 at 12 00 21 PM

I will say that it seems that our tile size is too small when we have vectors with many dimensions, so perhaps there is actually a better approach than big number / np.dtype(vector_type).itemsize / dimensions). So if you have suggestions, please do let me know.

Testing C++ changes

Here we build a Vamana index on SIFT (1 million 128 dimension float32 vectors):

def test_vamana_ingest():
    sift_base_uri = "/Users/parismorgan/Documents/tiledb/sift/sift_base.fvecs"
    data = load_fvecs(sift_base_uri)
    index_uri = f"/Users/parismorgan/Documents/tiledb/vamana/vamana_index_new"
    ingest(
            index_type="VAMANA",
            index_uri=index_uri,
            input_vectors=data,
            config=tiledb.cloud.Config().dict(),
     )

We have the same size with and without this change (the difference in the tile_size is only 2x (float32 to uint64) versus 8x (uint8 to uint64) and the dataset is larger so whatever the difference in data overflow beyond the last time looks to be negligible.

Screenshot 2024-07-05 at 1 45 46 PM

@jparismorgan jparismorgan marked this pull request as ready for review July 7, 2024 17:52
@@ -196,7 +195,7 @@ def create(
ids_array_rows_dim = tiledb.Dim(
name="rows",
domain=(0, MAX_INT32),
tile=tile_size,
tile=int(TILE_SIZE_BYTES / np.dtype(np.uint64).itemsize / dimensions),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be

int(TILE_SIZE_BYTES / np.dtype(np.uint64).itemsize)

as IDs are not a vector but just one value per vector.

However, the main point of having the same tile_size between vectors and IDs is that we always read these arrays with the same slicing queries which are mainly filtering different ivf partitions. Having much more values inside ID tiles can give some optimization in disk space but it might lead in suboptimal read filtering by always over-fetching data on the ID array read.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay awesome talked offline - understand now and don't think we need to make this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants