Handle injected write error after successful WAL write in crash test + misc #12838

hx235 · 2024-07-04T16:48:45Z

Context/Summary:
We discovered the following false positive in our crash test lately:
(1) PUT() writes k/v to WAL but fails in ApplyWALToManifest(). The k/v is in the WAL
(2) Current stress test logic will rollback the expected state of such k/v since PUT() fails
(3) If the DB crashes before recovery finishes and reopens, the WAL will be replayed and the k/v is in the DB while the expected state have been roll-backed.

We decided to leave those expected state to be pending until the loop-write of the same key succeeds.

Bonus: Now that I realized write to manifest can also fail the write which faces the similar problem as #12797, I decided to disable fault injection on user write per thread (instead of globally) when tracing is needed for prefix recovery; some refactory

Test:
Rehearsal CI
Run below command (varies on sync_fault_injection=1,0 to verify ExpectedState behavior) for a while to ensure crash recovery validation works fine

python3 tools/db_crashtest.py --simple blackbox --interval=30 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=20 --delrangepercent=20 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=8 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=8 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=10 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=8 --writepercent=40

facebook-github-bot · 2024-07-04T16:49:05Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2024-07-04T17:39:19Z

irrelevant error failure

cbi42

Would it work if we just keep the expected state at pending state? Ideally only when injected error is from writing to MANIFEST. If the failure is from writing to WAL, we can just treat it as failed.

cbi42 · 2024-07-05T20:03:28Z

db_stress_tool/db_stress_test_base.cc

@@ -1378,16 +1384,76 @@ void StressTest::OperateDb(ThreadState* thread) {
      } else if (prob_op < write_bound) {
        assert(prefix_bound <= prob_op);
        // OPERATION write
+        if (disable_fault_injection_during_put_delete) {
+          fault_fs_guard->DisableThreadLocalErrorInjection(


These enable/disable code are pretty common, maybe move to a helper function.

Would it work if we just keep the expected state at pending state? I

Yeah but that will require too much change on the ExpectedState that needs some time to resonate as pending state was designed to be an intermediate state under lock so I lean toward not doing so. I will address the rest of the TODO

Fixed refactory and expected state

hx235 · 2024-07-08T18:44:57Z

Discussion with @cbi42 also includes leaving the PENDING STATE of expected value upon injected error. Thinking about it

facebook-github-bot · 2024-07-10T18:02:34Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

hx235 · 2024-07-10T18:07:54Z

Would it work if we just keep the expected state at pending state? Ideally only when injected error is from writing to MANIFEST. If the failure is from writing to WAL, we can just treat it as failed.

I decided to treat write-to-WAL failure the same as the manifest case for two reasons: they both triggers auto-recovery and it's hard to do so thoroughly now when SyncDir fails for WAL with metadata write error injection.

facebook-github-bot · 2024-07-10T18:08:13Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2024-07-10T21:36:26Z

Irrelevant crash test failures

cbi42 · 2024-07-11T00:29:49Z

db_stress_tool/no_batched_ops_stress.cc

-        shared->PreparePut(rand_column_family, rand_key);
+        shared->PreparePut(rand_column_family, rand_key, &prepared);
+    if (!prepared) {
+      pending_expected_value.PermitUnclosedPendingState();


Should we continue the write here instead of returning early?

This early return is to not writing to a key whose earlier write was being recovered.

I was thinking we can still write to them? If recovery is not done, then the injected error will be returned. If recovery is done, then the write can succeed and we can commit the expected value update.

Discussed offline

facebook-github-bot · 2024-07-13T00:41:37Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-13T00:42:09Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2024-07-13T00:47:05Z

After further discussion with @cbi42, for write failed after WAL write (e.g, ApplyWALToManifest failure), its expected state is left pending. For other writes, we either rollback or commit the expected state. To close those pending expected state, initial thought was to loop-write the same key till success. But that surfaced an internal bug not easy to fix (being discussed internally). So for now, this PR closes those expected state as part of the restart (VerifyOrSyncValue()) naturally. Downside is up to n=number of crash test threads (default 32) keys are left pending state till next db restart (instead of till error recovery finishes during the same session) and won't accept any write.

Summary: **Context/Summary:** WAL sync currently happens after memtable write. This causes inconvenience in stress test as we can't simply rollback the ExpectedState when write fails due to injected WAL sync error so something complicated like #12838 might be needed. After moving WAL sync before memtable insertion, there should not be injected IO error after memtable insertion so we can keep the current simple way of handling failed write in stress test with ExpectedState rollback. Pull Request resolved: #12869 Test Plan: 1. Below command failed with `iterator has key 0000000000000207000000000000012B0000000000000013, but expected state does not.` before this PR and passes after ``` ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=50 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=0 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=50 Reviewed By: jowlyzhang Differential Revision: D59825730 Pulled By: hx235 fbshipit-source-id: 7d77aaf177ded2f99bf1ce19f5a4bd0783b9ca92

facebook-github-bot · 2024-07-19T23:27:10Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

hx235 · 2024-07-19T23:30:41Z

To close those pending expected state, initial thought was to loop-write the same key till success. But that surfaced an internal bug not easy to fix (being discussed internally). So for now, this PR closes those expected state as part of the restart (VerifyOrSyncValue()) naturally.

@cbi42 As the internal bug is fixed in #12873, this PR then closes those expected state in a loop-write manner.

cbi42 · 2024-07-23T21:15:08Z

db_stress_tool/db_stress_shared_state.h

-  // write or delete to the same `key`.
-  PendingExpectedValue PreparePut(int cf, int64_t key) {
-    return expected_state_manager_->PreparePut(cf, key);
+  // It can fail and `*prepared` will be set to false if the previous write or


Do we still need these changes with the loop-retry approach?

I intend to have this so we don't have to answer what value read is expected to return when we see a pending write and pending delete to the same key. This is possible because we release lock during sleep before retrying for write to the same key. I sense it will be too much reasoning complexity to add to Expected State for an edge case.

cbi42 · 2024-07-25T18:22:42Z

db/db_impl/db_impl_write.cc

+
+#ifndef NDEBUG
+  if (io_s.ok()) {
+    SET_WAL_WRITE_SUCCEED();


The writer thread may not become leader of a write group to enter this function and set wal_write_succeed. Maybe we can augment the error message for WAL related ones.

ahhh good catch ...

Fixed by using error message with some minor refactoring in generating injected error message

facebook-github-bot · 2024-07-27T23:12:39Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-27T23:13:55Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cbi42

LGTM

facebook-github-bot · 2024-07-29T19:35:01Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-07-29T19:36:06Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-07-29T20:54:50Z

@hx235 merged this pull request in 408e8d4.

Summary: **Context:** #12838 allows a write thread encountered certain injected error to release the lock and sleep before retrying write in order to reduce performance cost. This requires adding checks like [this](https://github.com/facebook/rocksdb/blob/b26b395e0a15255d322be08110db551976188745/db_stress_tool/expected_value.cc#L29-L31) to prevent writing to the same key from another thread. The added check causes a false-positive failure when delete range + file ingestion + backup is used. Consider the following scenario: (1) Issue a delete range covering some key that do not exist and a key does exist (named as k1). k1 will have "pending delete" state while the keys that does not exit will have whatever state they already have since we don't delete a key that does not exist already. (2) After #12838, `PrepareDeleteRange(... &prepared)` will return `prepared = false`. So below logic will be executed and k1's "pending delete" won't get roll-backed nor committed. ``` std::vector<PendingExpectedValue> pending_expected_values = shared->PrepareDeleteRange(rand_column_family, rand_key, rand_key + FLAGS_range_deletion_width, &prepared); if (!prepared) { for (PendingExpectedValue& pending_expected_value : pending_expected_values) { pending_expected_value.PermitUnclosedPendingState(); } return s; } ``` (3) Issue an file ingestion covering k1 and another key k2. Similar to (2), we will have `shared->PreparePut(column_family, key, &prepared)` return `prepared = false` for k1 while k2 will have a "pending put" state. So below logic will be executed and k2's "pending put" state won't get roll-backed nor committed. ``` for (int64_t key = key_base; s.ok() && key < shared->GetMaxKey() && static_cast<int32_t>(keys.size()) < FLAGS_ingest_external_file_width; ++key) PendingExpectedValue pending_expected_value = shared->PreparePut(column_family, key, &prepared); if (!prepared) { pending_expected_value.PermitUnclosedPendingState(); for (PendingExpectedValue& pev : pending_expected_values) { pev.PermitUnclosedPendingState(); } return; } } ``` (4) Issue a backup and verify on k2. Below logic decides that k2 should exist in restored DB since it has a pending write state while k2 is never ingested into the original DB as (3) returns early. ``` bool Exists() const { return PendingPut() || !IsDeleted(); } TestBackupRestore() { ... Status get_status = restored_db->Get( read_opts, restored_cf_handles[rand_column_families[i]], key, &restored_value); bool exists = thread->shared->Exists(rand_column_families[i], rand_keys[0]); if (get_status.ok()) { if (!exists && from_latest && ShouldAcquireMutexOnKey()) { std::ostringstream oss; oss << "0x" << key.ToString(true) << " exists in restore but not in original db"; s = Status::Corruption(oss.str()); } } else if (get_status.IsNotFound()) { if (exists && from_latest && ShouldAcquireMutexOnKey()) { std::ostringstream oss; oss << "0x" << key.ToString(true) << " exists in original db but not in restore"; s = Status::Corruption(oss.str()); } } ... } ``` So we see false-positive corruption like `Failure in a backup/restore operation with: Corruption: 0x000000000000017B0000000000000073787878 exists in original db but not in restore` A simple fix is to remove `PendingPut()` from `bool Exists() ` since it's called under a lock and should never see a pending write. However, in order for "under a lock and should never see a pending write" to be true, we need to remove the logic of releasing the lock during sleep in the write thread, which expose pending write to other thread that can call Exists() like back up thread. The downside of holding lock during sleep is blocking other write thread of the same key to proceed cuz they need to wait for the lock. This should happen rarely as the key of a thread is selected randomly in crash test like below. ``` void StressTest::OperateDb(ThreadState* thread) { for (uint64_t i = 0; i < ops_per_open; i++) { ... int64_t rand_key = GenerateOneKey(thread, i); ... } } ``` **Summary:** - Removed the "lock release" part and related checks - Printed recovery time if the write thread waited more than 10 seconds - Reverted regression in testing coverage when deleting a non-existent key Pull Request resolved: #12917 Test Plan: Below command repro-ed frequently before the fix and not after. ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=8388608 --blob_compaction_readahead_size=1048576 --blob_compression_type=none --blob_file_size=1073741824 --blob_file_starting_level=1 --blob_garbage_collection_age_cutoff=0.0 --blob_garbage_collection_force_threshold=0.75 --block_align=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=16.216959977115277 --bottommost_compression_type=xpress --bottommost_file_compaction_delay=600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kXXH3 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=0 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=10 --compress_format_version=2 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=2097151 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=0 --delrangepercent=5 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_blob_files=0 --enable_blob_garbage_collection=1 --enable_checksum_handoff=1 --enable_compaction_filter=1 --enable_custom_split_merge=1 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=2 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0.5 --index_block_restart_interval=1 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=1000 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_blob_size=16 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=10 --prefix_size=8 --prefixpercent=0 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=60 --recycle_log_file_num=1 --reopen=20 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=118 --universal_max_read_amp=-1 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_blob_cache=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D60890580 Pulled By: hx235 fbshipit-source-id: 401f90d6d351c7ee11088cad06fb00e54062d416

facebook-github-bot added the CLA Signed label Jul 4, 2024

hx235 requested a review from cbi42 July 5, 2024 18:14

cbi42 reviewed Jul 5, 2024

View reviewed changes

hx235 force-pushed the fix_iter branch from d14130b to 4644e36 Compare July 10, 2024 18:02

hx235 changed the title ~~Retry write on injected error in crash test + misc~~ Handle write with injected error in crash test + misc Jul 10, 2024

hx235 requested a review from cbi42 July 10, 2024 21:36

cbi42 reviewed Jul 11, 2024

View reviewed changes

hx235 changed the title ~~Handle write with injected error in crash test + misc~~ Handle injected write error after successful WAL write in crash test + misc Jul 11, 2024

hx235 force-pushed the fix_iter branch from 4644e36 to 475aaa1 Compare July 13, 2024 00:41

hx235 changed the title ~~Handle injected write error after successful WAL write in crash test + misc~~ [WIP] Handle injected write error after successful WAL write in crash test + misc Jul 14, 2024

hx235 mentioned this pull request Jul 16, 2024

Move ApplyWALToManifest before memtable insertion #12869

Closed

hx235 force-pushed the fix_iter branch from 475aaa1 to 3e1f10d Compare July 19, 2024 23:27

hx235 changed the title ~~[WIP] Handle injected write error after successful WAL write in crash test + misc~~ Handle injected write error after successful WAL write in crash test + misc Jul 19, 2024

hx235 requested review from cbi42 and removed request for cbi42 July 22, 2024 19:54

hx235 changed the title ~~Handle injected write error after successful WAL write in crash test + misc~~ [WIP] Handle injected write error after successful WAL write in crash test + misc Jul 23, 2024

cbi42 reviewed Jul 23, 2024

View reviewed changes

hx235 changed the title ~~[WIP] Handle injected write error after successful WAL write in crash test + misc~~ Handle injected write error after successful WAL write in crash test + misc Jul 25, 2024

cbi42 reviewed Jul 25, 2024

View reviewed changes

fix

565fc5c

hx235 force-pushed the fix_iter branch from 3e1f10d to 31b41bc Compare July 27, 2024 23:12

hx235 requested a review from cbi42 July 28, 2024 01:08

cbi42 approved these changes Jul 29, 2024

View reviewed changes

Address comment

6c91319

hx235 force-pushed the fix_iter branch from 31b41bc to 6c91319 Compare July 29, 2024 19:34

facebook-github-bot closed this in 408e8d4 Jul 29, 2024

facebook-github-bot added the Merged label Jul 29, 2024

hx235 mentioned this pull request Aug 7, 2024

Fix false-positive TestBackupRestore corruption #12917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle injected write error after successful WAL write in crash test + misc #12838

Handle injected write error after successful WAL write in crash test + misc #12838

hx235 commented Jul 4, 2024 •

edited

Loading

facebook-github-bot commented Jul 4, 2024

hx235 commented Jul 4, 2024

cbi42 left a comment

cbi42 Jul 5, 2024

hx235 Jul 8, 2024 •

edited

Loading

hx235 Jul 10, 2024

hx235 commented Jul 8, 2024

facebook-github-bot commented Jul 10, 2024

hx235 commented Jul 10, 2024

facebook-github-bot commented Jul 10, 2024

hx235 commented Jul 10, 2024

cbi42 Jul 11, 2024

hx235 Jul 11, 2024

cbi42 Jul 11, 2024

hx235 Jul 22, 2024

facebook-github-bot commented Jul 13, 2024

facebook-github-bot commented Jul 13, 2024

hx235 commented Jul 13, 2024

facebook-github-bot commented Jul 19, 2024

hx235 commented Jul 19, 2024 •

edited

Loading

cbi42 Jul 23, 2024

hx235 Jul 25, 2024

cbi42 Jul 25, 2024

hx235 Jul 25, 2024

hx235 Jul 27, 2024

facebook-github-bot commented Jul 27, 2024

facebook-github-bot commented Jul 27, 2024

cbi42 left a comment

facebook-github-bot commented Jul 29, 2024

facebook-github-bot commented Jul 29, 2024

facebook-github-bot commented Jul 29, 2024

Handle injected write error after successful WAL write in crash test + misc #12838

Handle injected write error after successful WAL write in crash test + misc #12838

Conversation

hx235 commented Jul 4, 2024 • edited Loading

facebook-github-bot commented Jul 4, 2024

hx235 commented Jul 4, 2024

cbi42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 commented Jul 8, 2024

facebook-github-bot commented Jul 10, 2024

hx235 commented Jul 10, 2024

facebook-github-bot commented Jul 10, 2024

hx235 commented Jul 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 13, 2024

facebook-github-bot commented Jul 13, 2024

hx235 commented Jul 13, 2024

facebook-github-bot commented Jul 19, 2024

hx235 commented Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 27, 2024

facebook-github-bot commented Jul 27, 2024

cbi42 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 29, 2024

facebook-github-bot commented Jul 29, 2024

facebook-github-bot commented Jul 29, 2024

hx235 commented Jul 4, 2024 •

edited

Loading

hx235 Jul 8, 2024 •

edited

Loading

hx235 commented Jul 19, 2024 •

edited

Loading