Fix bug where we did not set compression filters when creating TileDB Array's in C++ #436

jparismorgan · 2024-07-05T15:01:13Z

What

Currently when we create TileDB arrays with create_vector() or create_matrix(), we pass a filter, but then do not use it when creating the array. This leads to arrays (and indexes) that are very large. We fix that here.

Results with small datasets

Before when we created a Vamana index with 4 float vectors that have 3 dimensions each with this code:

TEST_CASE("write and load index with timestamps", "[api_vamana_index]") {
  auto ctx = tiledb::Context{};
  using feature_type_type = float;
  using id_type_type = uint64_t;
  auto feature_type = "float32";
  auto id_type = "uint64";

  std::string index_uri = "/tmp/api_vamana_index_test";
  tiledb::VFS vfs(ctx);
  if (vfs.is_dir(index_uri)) {
    vfs.remove_dir(index_uri);
  }

  auto index = IndexVamana(std::make_optional<IndexOptions>(
      {{"feature_type", feature_type},
        {"id_type", id_type},
        {"l_build", std::to_string(l_build)},
        {"r_max_degree", std::to_string(r_max_degree)}}));

  auto training = ColMajorMatrixWithIds<feature_type_type, id_type_type>{
      {{1, 1, 1}, {2, 2, 2}, {3, 3, 3}, {4, 4, 4}}, {1, 2, 3, 4}};

  auto training_vector_array = FeatureVectorArray(training);
  index.train(training_vector_array);
  index.add(training_vector_array);
  index.write_index(ctx, index_uri, TemporalPolicy(TimeTravel, 99));
}

We'd end up with a 314.5 MB index, where each individual array containing vectors was 89.5 MB:

After this change we end up with 261 KB index, and an individual vector array takes up 66 KB:

Results on SIFT

The improvement is even greater with SIFT (1 million 128 dimension float32 vectors). When we create a Vamana index with this code:

def test_vamana_ingest():
    sift_base_uri = "/Users/parismorgan/Documents/tiledb/sift/sift_base.fvecs"
    data = load_fvecs(sift_base_uri)
    index_uri = f"/Users/parismorgan/Documents/tiledb/vamana/vamana_index_new"
    ingest(
            index_type="VAMANA",
            index_uri=index_uri,
            input_vectors=data,
            config=tiledb.cloud.Config().dict(),
     )

Before we would end up with a 23.88 GB index where adjacency_ids alone takes up 15.55 GB:

Before now our index is 761 MB and adjacency_ids takes up 281 MB:

… Array's in C++ (#436)

jparismorgan added 6 commits July 5, 2024 11:15

specify ids array tile_size based on ids array attribute datatype

f3f6229

C++ changes

c8eff66

Set filter when we create vector arrays in C++

75b59cb

undo old changes

a671daa

add fix create_matrix()

d93c41b

fix build

c48b1c4

jparismorgan marked this pull request as ready for review July 7, 2024 17:51

jparismorgan requested review from ihnorton, cainamisir and NikolaosPapailiou July 7, 2024 17:52

NikolaosPapailiou approved these changes Jul 8, 2024

View reviewed changes

jparismorgan merged commit a6b7b23 into main Jul 8, 2024
6 checks passed

jparismorgan deleted the jparismorgan/filters branch July 8, 2024 08:43

cainamisir pushed a commit that referenced this pull request Jul 23, 2024

Fix bug where we did not set compression filters when creating TileDB…

b57ae8d

… Array's in C++ (#436)

cainamisir pushed a commit that referenced this pull request Jul 23, 2024

Fix bug where we did not set compression filters when creating TileDB…

cf5830b

… Array's in C++ (#436)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug where we did not set compression filters when creating TileDB Array's in C++ #436

Fix bug where we did not set compression filters when creating TileDB Array's in C++ #436

jparismorgan commented Jul 5, 2024

Fix bug where we did not set compression filters when creating TileDB Array's in C++ #436

Fix bug where we did not set compression filters when creating TileDB Array's in C++ #436

Conversation

jparismorgan commented Jul 5, 2024

What

Results with small datasets

Results on SIFT