Refactor IndexBuilder::AddIndexEntry #12867

pdillinger · 2024-07-16T00:33:02Z

Summary: Something I am working on is going to expand usage of BlockBasedTableBuilder::Rep::last_key, but the existing code contract for IndexBuilder::AddIndexEntry makes that difficult because it modifies its last_key parameter to be the separator value recorded in the index, often something between the two boundary keys.

This change primarily changes the contract of that function and related functions to separate function inputs and outputs, without sacrificing efficiency. For efficiency, a reusable scratch string buffer is provided by the caller, which the callee can use (or not) in returning a result Slice. That should yield a performance improvement as we are reusing a buffer for keys rather than copying into a new one each time in the FindShort* functions, without any additional string copies or conditional branches.

Additional improvements in PartitionedIndexBuilder specifically:

Reduce string copies by eliminating sub_index_last_key_ and instead tracking the key for the next partition in a placeholder Entry.
Simplify code and improve code quality by changing sub_index_builder_ to unique_ptr.
Eliminate unnecessary NewFlushBlockPolicy call/object.

Test Plan: existing tests, crash test. Will validate performance along with the change this is setting up.

Summary: Something I am working on is going to expand usage of `BlockBasedTableBuilder::Rep::last_key` and the existing code contract for `IndexBuilder::AddIndexEntry` makes that difficult because it modifies its `last_key` parameter to be the separator value recorded in the index, often something between the two boundary keys. This change primarily changes the contract of that function and related functions to separate function inputs and outputs, without sacrificing efficiency. For efficiency, a reusable scratch string buffer is provided by the caller, which the callee can use (or not) in returning a result Slice. That should yield a performance improvement as we are reusing a buffer for keys rather than copying into a new one each time in the FindShort* functions, without any additional string copies or conditional branches. Additional improvements in PartitionedIndexBuilder specifically: * Reduce string copies by eliminating `sub_index_last_key_` and instead tracking the key for the next partition in a placeholder Entry. * Simplify code and improve code quality by changing `sub_index_builder_` to unique_ptr. * Eliminate unnecessary NewFlushBlockPolicy call/object. Test Plan: existing tests, crash test. Will validate performance along with the change this is setting up.

facebook-github-bot · 2024-07-16T04:53:37Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

anand1976 · 2024-07-22T17:23:32Z

table/block_based/index_builder.cc

-                                      first_key_in_next_block, block_handle);
+    auto sep = sub_index_builder_->AddIndexEntry(
+        last_key_in_current_block, first_key_in_next_block, block_handle,
+        separator_scratch);
    if (!seperator_is_key_plus_seq_ &&
        sub_index_builder_->seperator_is_key_plus_seq_) {
      // then we need to apply it to all sub-index builders and reset
      // flush_policy to point to Block Builder of sub_index_builder_ that store


Comment needs to be updated.

anand1976 · 2024-07-22T17:29:28Z

table/block_based/index_builder.h

  // To allow further optimization, we provide `last_key_in_current_block` and
  // `first_key_in_next_block`, based on which the specific implementation can
  // determine the best index key to be used for the index block.
  // Called before the OnKeyAdded() call for first_key_in_next_block.
-  // @last_key_in_current_block: this parameter maybe overridden with the value
-  //                             "substitute key".
+  // @last_key_in_current_block: TODO lifetime details


Fix the TODO?

facebook-github-bot · 2024-07-22T19:08:13Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-22T19:09:26Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-07-22T21:30:48Z

@pdillinger merged this pull request in f456a72.

Summary: This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in facebook#12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys). Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.) Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now. The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies. Test Plan: existing tests (with some minor updates) Also manually ported over the pre-broken regression test described in facebook#12870 and ran it (passed). Performance: Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: facebook#12872 facebook#12911 facebook#12874 facebook#12867 facebook#12903 facebook#12904. "After" includes this PR (and all of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency: ``` for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results ``` Showing average ops/sec of "after" vs. "before" ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 935586 vs. 928176 (+0.79%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 930171 vs. 926801 (+0.36%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 910727 vs. 894397 (+1.8%) -partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1 929795 vs. 922007 (+0.84%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 921924 vs. 917285 (+0.51%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 903393 vs. 887340 (+1.8%) ``` As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key.

Summary: This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in #12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys). Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.) Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now. The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies. Pull Request resolved: #12931 Test Plan: existing tests (with some minor updates) Also manually ported over the pre-broken regression test described in #12870 and ran it (passed). Performance: Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: #12872 #12911 #12874 #12867 #12903 #12904. "After" includes this PR (and all of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency: ``` for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results ``` Showing average ops/sec of "after" vs. "before" ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 935586 vs. 928176 (+0.79%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 930171 vs. 926801 (+0.36%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 910727 vs. 894397 (+1.8%) -partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1 929795 vs. 922007 (+0.84%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 921924 vs. 917285 (+0.51%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 903393 vs. 887340 (+1.8%) ``` As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key. Reviewed By: jowlyzhang Differential Revision: D61138271 Pulled By: pdillinger fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043

pdillinger requested a review from anand1976 July 16, 2024 00:33

facebook-github-bot added the CLA Signed label Jul 16, 2024

anand1976 approved these changes Jul 22, 2024

View reviewed changes

pdillinger added 2 commits July 22, 2024 12:03

Fix/update comments

b5cad22

Merge remote-tracking branch 'origin/main' into refactor_add_index_entry

05aeefc

facebook-github-bot closed this in f456a72 Jul 22, 2024

facebook-github-bot added the Merged label Jul 22, 2024

pdillinger mentioned this pull request Aug 12, 2024

Optimize, simplify filter block building (fix regression) #12931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor IndexBuilder::AddIndexEntry #12867

Refactor IndexBuilder::AddIndexEntry #12867

pdillinger commented Jul 16, 2024

facebook-github-bot commented Jul 16, 2024

anand1976 Jul 22, 2024

anand1976 Jul 22, 2024

facebook-github-bot commented Jul 22, 2024

facebook-github-bot commented Jul 22, 2024

facebook-github-bot commented Jul 22, 2024

Refactor IndexBuilder::AddIndexEntry #12867

Refactor IndexBuilder::AddIndexEntry #12867

Conversation

pdillinger commented Jul 16, 2024

facebook-github-bot commented Jul 16, 2024

anand1976 Jul 22, 2024

Choose a reason for hiding this comment

anand1976 Jul 22, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Jul 22, 2024

facebook-github-bot commented Jul 22, 2024

facebook-github-bot commented Jul 22, 2024