Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where we did not set compression filters when creating TileDB Array's in C++ #436

Merged
merged 6 commits into from
Jul 8, 2024

Conversation

jparismorgan
Copy link
Collaborator

What

Currently when we create TileDB arrays with create_vector() or create_matrix(), we pass a filter, but then do not use it when creating the array. This leads to arrays (and indexes) that are very large. We fix that here.

Results with small datasets

Before when we created a Vamana index with 4 float vectors that have 3 dimensions each with this code:

TEST_CASE("write and load index with timestamps", "[api_vamana_index]") {
  auto ctx = tiledb::Context{};
  using feature_type_type = float;
  using id_type_type = uint64_t;
  auto feature_type = "float32";
  auto id_type = "uint64";

  std::string index_uri = "/tmp/api_vamana_index_test";
  tiledb::VFS vfs(ctx);
  if (vfs.is_dir(index_uri)) {
    vfs.remove_dir(index_uri);
  }

  auto index = IndexVamana(std::make_optional<IndexOptions>(
      {{"feature_type", feature_type},
        {"id_type", id_type},
        {"l_build", std::to_string(l_build)},
        {"r_max_degree", std::to_string(r_max_degree)}}));

  auto training = ColMajorMatrixWithIds<feature_type_type, id_type_type>{
      {{1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4}}, {1, 2, 3, 4}};

  auto training_vector_array = FeatureVectorArray(training);
  index.train(training_vector_array);
  index.add(training_vector_array);
  index.write_index(ctx, index_uri, TemporalPolicy(TimeTravel, 99));
}

We'd end up with a 314.5 MB index, where each individual array containing vectors was 89.5 MB:

Screenshot 2024-07-05 at 4 21 05 PM

After this change we end up with 261 KB index, and an individual vector array takes up 66 KB:

Screenshot 2024-07-05 at 4 21 58 PM

Results on SIFT

The improvement is even greater with SIFT (1 million 128 dimension float32 vectors). When we create a Vamana index with this code:

def test_vamana_ingest():
    sift_base_uri = "/Users/parismorgan/Documents/tiledb/sift/sift_base.fvecs"
    data = load_fvecs(sift_base_uri)
    index_uri = f"/Users/parismorgan/Documents/tiledb/vamana/vamana_index_new"
    ingest(
            index_type="VAMANA",
            index_uri=index_uri,
            input_vectors=data,
            config=tiledb.cloud.Config().dict(),
     )

Before we would end up with a 23.88 GB index where adjacency_ids alone takes up 15.55 GB:

Screenshot 2024-07-05 at 4 28 05 PM

Before now our index is 761 MB and adjacency_ids takes up 281 MB:

Screenshot 2024-07-05 at 5 01 07 PM

@jparismorgan jparismorgan marked this pull request as ready for review July 7, 2024 17:51
@jparismorgan jparismorgan merged commit a6b7b23 into main Jul 8, 2024
6 checks passed
@jparismorgan jparismorgan deleted the jparismorgan/filters branch July 8, 2024 08:43
cainamisir pushed a commit that referenced this pull request Jul 23, 2024
cainamisir pushed a commit that referenced this pull request Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants