Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize, simplify filter block building (fix regression) #12931

Closed

Conversation

pdillinger
Copy link
Contributor

Summary: This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in #12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys).

Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.)

Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.

The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies.

Test Plan: existing tests (with some minor updates)

Also manually ported over the pre-broken regression test described in
#12870 and ran it (passed).

Performance:
Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: #12872 #12911 #12874 #12867 #12903 #12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency:

for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results

Showing average ops/sec of "after" vs. "before"

-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)

As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key.

Summary: This is in part a refactoring / simplification to set up for
"decoupled" partitioned filters and in part to fix an intentional
regression for a correctness fix in facebook#12872. Basically, we are taking out
some complexity of the filter block builders, and pushing part of it
(simultaneous de-duplication of prefixes and whole keys) into the filter
bits builders, where it is more efficient by operating on hashes (rather
than copied keys).

Previously, the FullFilterBlockBuilder had a somewhat fragile and
confusing set of conditions under which it would keep a copy of the most
recent prefix and most recent whole key, along with some other state
that is essentially redundant. Now we just track (always) the previous
prefix in the PartitionedFilterBlockBuilder, to deal with the boundary
prefix Seek filtering problem. (Btw, the next PR will optimize this
away since BlockBasedTableReader already tracks the previous key.)
And to deal with the problem of de-duplicating both whole keys and
prefixes going into a single filter, we add a new function to
FilterBitsBuilder that has that extra de-duplication capabilty, which
is relatively efficient because we only have to cache an extra 64-bit
hash, not a copied key or prefix. (The API of this new function is
somewhat awkward to avoid a small CPU regression in some cases.)

Also previously, there was awkward logic split between
FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal
with some things specific to partitioning. And confusing names like Add
vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.

The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock
into DecideCutAFilterBlock and CutAFilterBlock is to address what would
have been a slight performance regression in some cases. The split
allows for more intruction-level parallelism by reducing unnecessary
control dependencies.

Test Plan: existing tests (with some minor updates)

Also manually ported over the pre-broken regression test described in
 facebook#12870 and ran it (passed).

Performance:
Here we validate that an entire series of recent related PRs are a net
improvement in aggregate. "Before" is with these PRs reverted: facebook#12872
 facebook#12911 facebook#12874 facebook#12867 facebook#12903 facebook#12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed
to maximally depend on SST construction efficiency:

```
for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results
```

Showing average ops/sec of "after" vs. "before"

```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)
```

As one would predict, the most improvement is seen in cases where we
have optimized away copying the whole key.
@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@jowlyzhang jowlyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for this brilliant improvement. I just have some questions for my own learning.

alt_hash != prev_key_hash) {
AddHash(alt_hash);
}
// Overwrite prev_alt_hash for cases like alt_hash == key_hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this overwrites prev_alt_hash regardless of whether alt_hash and key_hash is the same, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's possible for a prefix to equal the key it comes from. E.g.

foo,foo // <- add "foo" as key but also as prev_alt
foo123,foo // <- "foo" is still prev_alt
foo456,foo // <- depends on prev_alt having been updated despite earlier matching of prev_key or key

// is extra state for de-duplicating successive `alt` entries, as well
// as successive `key` entries. And there is usually de-duplication between
// `key` and `alt` entries, such that it's typically OK for an `alt` without
// a corresponding `key` to use AddKey().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own learning, if an alt directly uses Addkey, it will be deduplicated with the last element in the queue, as opposed to the second to last element in the queue, why is that ok for an alt without a corresponding key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to change this language to say it's OK to use AddKey on an "alt" as the first and/or last addition to a filter, which is what we need for solving the Seek correctness problem with partitioned filters. As first or last entry, it doesn't interfere with the de-duplication chain of regular keys. :) Thanks!

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pdillinger merged this pull request in f63428b.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Aug 16, 2024
Summary: Partitioned metadata blocks were introduced back in 2017 to
deal more gracefully with large DBs where RAM is relatively scarce and
some data might be much colder than other data. The feature allows
metadata blocks to compete for memory in the block cache against data
blocks while alleviating tail latencies and thrash conditions that can
arise with large metadata blocks (sometimes megabytes each) that can
arise with large SST files. In general, the cost to partitioned metadata
is more CPU in accesses (especially for filters where more binary search
is needed before hashing can be used) and a bit more memory
fragmentation and related overheads.

However the feature has always had a subtle limitation with a subtle
effect on performance: index partitions and filter partitions must be
cut at the same time, regardless of which wins the space race (hahaha)
to metadata_block_size. Commonly filters will be a few times larger
than indexes, so index partitions will be under-sized compared to filter
(and data) blocks. While this does affect fragmentation and related
overheads a bit, I suspect the bigger impact on performance is in the
block cache. The coupling of the partition cuts would be defensible if
the binary search done to find the filter block was used (on filter hit)
to short-circuit binary search to an index partition, but that
optimization has not been developed.

Consider two metadata blocks, an under-sized one and a normal-sized one,
covering proportional sections of the key space with the same density of
read queries. The under-sized one will be more prone to eviction from
block cache because it is used less often. This is unfair because of its
despite its proportionally smaller cost of keeping in block cache, and
most of the cost of a miss to re-load it (random IO) is not proportional
to the size (similar latency etc. up to ~32KB).

 ## This change

Adds a new table option decouple_partitioned_filters allows filter blocks
and index blocks to be cut independently. To make this work, the
partitioned filter block builder needs to know about the previous key,
to generate an appropriate separator for the partition index. In most
cases, BlockBasedTableBuilder already has easy access to the previous
key to provide to the filter block builder.

This change includes refactoring to pass that previous key to the filter
builder when available, with the filter building caching the previous
key itself when unavailable, such as during compression dictionary
training and some unit tests. Access to the previous key eliminates the
need to track the previous prefix, which results in a small SST
construction CPU win in prefix filtering cases, regardless of coupling,
and possibly a small regression for some non-prefix cases, regardless of
coupling, but still overall improvement especially with facebook#12931.

Suggested follow-up:
* Update confusing use of "last key" to refer to "previous key"
* Expand unit test coverage with parallel compression and dictionary
  training
* Consider an option or enhancement to alleviate under-sized metadata
  blocks "at the end" of an SST file due to no coordination or awareness
  of when files are cut.

Test Plan: unit tests updated. Also did some unit test runs with "hard
wired" usage of parallel compression and dictionary training code paths
to ensure they were working. Also ran blackbox_crash_test for a while
with the new feature.

Performance: ... TODO
facebook-github-bot pushed a commit that referenced this pull request Aug 16, 2024
Summary:
Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads.

However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed.

Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB).

 ## This change

Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder.

This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with #12931.

Suggested follow-up:
* Update confusing use of "last key" to refer to "previous key"
* Expand unit test coverage with parallel compression and dictionary training
* Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut.

Pull Request resolved: #12939

Test Plan:
unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature.

 ## SST write performance (CPU)

Using the same testing setup as in #12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR
```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
923691 vs. 924851 (-0.13%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
921398 vs. 922973 (-0.17%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
902259 vs. 908756 (-0.71%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
917932 vs. 916901 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
912755 vs. 907298 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
899754 vs. 892433 (+0.82%)
```
I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations.

 ## Read performance

Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup:

```
(for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 | tee results-$CS; done)
```

And read ops/s results:

```CSV
Cache size MB,After/decoupled/4k,Before/4k,Before/8k
3,15593,15158,12826
6,16295,16693,14134
10,20427,20813,18459
20,27035,26836,27384
30,33250,31810,33846
60,35518,32585,35329
100,36612,31805,35292
300,35780,31492,35481
1000,34145,31551,35411
1100,35219,31380,34302
1200,35060,31037,34322
```

If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration.

Reviewed By: jowlyzhang

Differential Revision: D61376772

Pulled By: pdillinger

fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants