Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor IndexBuilder::AddIndexEntry #12867

Closed
wants to merge 3 commits into from

Conversation

pdillinger
Copy link
Contributor

Summary: Something I am working on is going to expand usage of BlockBasedTableBuilder::Rep::last_key, but the existing code contract for IndexBuilder::AddIndexEntry makes that difficult because it modifies its last_key parameter to be the separator value recorded in the index, often something between the two boundary keys.

This change primarily changes the contract of that function and related functions to separate function inputs and outputs, without sacrificing efficiency. For efficiency, a reusable scratch string buffer is provided by the caller, which the callee can use (or not) in returning a result Slice. That should yield a performance improvement as we are reusing a buffer for keys rather than copying into a new one each time in the FindShort* functions, without any additional string copies or conditional branches.

Additional improvements in PartitionedIndexBuilder specifically:

  • Reduce string copies by eliminating sub_index_last_key_ and instead tracking the key for the next partition in a placeholder Entry.
  • Simplify code and improve code quality by changing sub_index_builder_ to unique_ptr.
  • Eliminate unnecessary NewFlushBlockPolicy call/object.

Test Plan: existing tests, crash test. Will validate performance along with the change this is setting up.

Summary: Something I am working on is going to expand usage of
`BlockBasedTableBuilder::Rep::last_key` and the existing code contract
for `IndexBuilder::AddIndexEntry` makes that difficult because it
modifies its `last_key` parameter to be the separator value recorded in
the index, often something between the two boundary keys.

This change primarily changes the contract of that function and related
functions to separate function inputs and outputs, without sacrificing
efficiency. For efficiency, a reusable scratch string buffer is provided
by the caller, which the callee can use (or not) in returning a result
Slice. That should yield a performance improvement as we are reusing a
buffer for keys rather than copying into a new one each time in the
FindShort* functions, without any additional string copies or
conditional branches.

Additional improvements in PartitionedIndexBuilder specifically:
* Reduce string copies by eliminating `sub_index_last_key_` and instead
tracking the key for the next partition in a placeholder Entry.
* Simplify code and improve code quality by changing
  `sub_index_builder_` to unique_ptr.
* Eliminate unnecessary NewFlushBlockPolicy call/object.

Test Plan: existing tests, crash test. Will validate performance along
with the change this is setting up.
@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

first_key_in_next_block, block_handle);
auto sep = sub_index_builder_->AddIndexEntry(
last_key_in_current_block, first_key_in_next_block, block_handle,
separator_scratch);
if (!seperator_is_key_plus_seq_ &&
sub_index_builder_->seperator_is_key_plus_seq_) {
// then we need to apply it to all sub-index builders and reset
// flush_policy to point to Block Builder of sub_index_builder_ that store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs to be updated.

// To allow further optimization, we provide `last_key_in_current_block` and
// `first_key_in_next_block`, based on which the specific implementation can
// determine the best index key to be used for the index block.
// Called before the OnKeyAdded() call for first_key_in_next_block.
// @last_key_in_current_block: this parameter maybe overridden with the value
// "substitute key".
// @last_key_in_current_block: TODO lifetime details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the TODO?

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in f456a72.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Aug 12, 2024
Summary: This is in part a refactoring / simplification to set up for
"decoupled" partitioned filters and in part to fix an intentional
regression for a correctness fix in facebook#12872. Basically, we are taking out
some complexity of the filter block builders, and pushing part of it
(simultaneous de-duplication of prefixes and whole keys) into the filter
bits builders, where it is more efficient by operating on hashes (rather
than copied keys).

Previously, the FullFilterBlockBuilder had a somewhat fragile and
confusing set of conditions under which it would keep a copy of the most
recent prefix and most recent whole key, along with some other state
that is essentially redundant. Now we just track (always) the previous
prefix in the PartitionedFilterBlockBuilder, to deal with the boundary
prefix Seek filtering problem. (Btw, the next PR will optimize this
away since BlockBasedTableReader already tracks the previous key.)
And to deal with the problem of de-duplicating both whole keys and
prefixes going into a single filter, we add a new function to
FilterBitsBuilder that has that extra de-duplication capabilty, which
is relatively efficient because we only have to cache an extra 64-bit
hash, not a copied key or prefix. (The API of this new function is
somewhat awkward to avoid a small CPU regression in some cases.)

Also previously, there was awkward logic split between
FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal
with some things specific to partitioning. And confusing names like Add
vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.

The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock
into DecideCutAFilterBlock and CutAFilterBlock is to address what would
have been a slight performance regression in some cases. The split
allows for more intruction-level parallelism by reducing unnecessary
control dependencies.

Test Plan: existing tests (with some minor updates)

Also manually ported over the pre-broken regression test described in
 facebook#12870 and ran it (passed).

Performance:
Here we validate that an entire series of recent related PRs are a net
improvement in aggregate. "Before" is with these PRs reverted: facebook#12872
 facebook#12911 facebook#12874 facebook#12867 facebook#12903 facebook#12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed
to maximally depend on SST construction efficiency:

```
for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results
```

Showing average ops/sec of "after" vs. "before"

```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)
```

As one would predict, the most improvement is seen in cases where we
have optimized away copying the whole key.
facebook-github-bot pushed a commit that referenced this pull request Aug 14, 2024
Summary:
This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in #12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys).

Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.)

Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.

The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies.

Pull Request resolved: #12931

Test Plan:
existing tests (with some minor updates)

Also manually ported over the pre-broken regression test described in
 #12870 and ran it (passed).

Performance:
Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: #12872 #12911 #12874 #12867 #12903 #12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency:

```
for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results
```

Showing average ops/sec of "after" vs. "before"

```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)
```

As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key.

Reviewed By: jowlyzhang

Differential Revision: D61138271

Pulled By: pdillinger

fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants