Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Data loss when docdb_filter_intent_ht is enabled #23890

Closed
1 task done
es1024 opened this issue Sep 11, 2024 · 0 comments
Closed
1 task done

[DocDB] Data loss when docdb_filter_intent_ht is enabled #23890

es1024 opened this issue Sep 11, 2024 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@es1024
Copy link
Contributor

es1024 commented Sep 11, 2024

Jira Link: DB-12794

Description

When docdb_filter_intent_ht is enabled, we attempt to filter out intent SST files by min_running_ht during transaction loading step of bootstrap, but this is too aggressive and causes data loss. This has been turned off (#23184) and needs to be fixed.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@es1024 es1024 added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Sep 11, 2024
@es1024 es1024 self-assigned this Sep 11, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Sep 11, 2024
@rthallamko3 rthallamko3 removed the status/awaiting-triage Issue awaiting triage label Sep 11, 2024
es1024 added a commit that referenced this issue Sep 17, 2024
… min_replay_txn_start_ht

Summary:
This diff adds filtering for intent SST files during the transaction loading step of tablet bootstrap, using the newly introduced `min_replay_txn_start_ht`.

When CDC is enabled, we persist intent SST files longer than they are otherwise needed, until CDC has streamed all transactions in the SST file and moved the retention barrier far enough ahead. This can lead to a large buildup of intent SST files which are not actually needed at bootstrap time.

8b23a4e / D35639 added changes to save `min_running_ht` periodically, and add intent SST file filtering during bootstrap time based on periodically saved values of `min_running_ht`. This can lead to data loss, if there is a transaction T such that the following is true:
- T has been applied (APPLIED written to WALs)
- T has intents that have been flushed to disk (this is rare but possible when CDC is disabled since in the ideal non-CDC case we never flush intents)
- Changes made by T on regulardb have not been flushed
- The metadata record for T is on an intent SST file whose max HT is less than min_running_ht after T apply (i.e. intentsdb flush happened between T writes and apply)
- Tablet bootstrap state has been saved after T has committed

These conditions will result in a `min_running_ht > T.start_time` being written to disk, and loaded during tablet bootstrap. Since regulardb changes have not been flushed, WAL replay will start from a point that includes T. However, transaction loader will not load T, because its metadata record has been excluded due to the SST file filter. This results in changes made by T being dropped, even though it has successfully committed.

This change introduces a new `min_replay_txn_start_ht` and changes the intent SST file filter to be based off of periodically saved values of this new `min_replay_txn_start_ht`.

`min_replay_txn_start_ht` is defined as the minimum of:
- `min_running_ht`
- `start_ht` of any transaction which may be read during WAL replay

WAL replay begins at `bootstrap_start_op_id = min(intentsdb flushed_op_id, rocksdb flushed_op_id, retryable requests last_flushed_op_id)`. We calculate `min_replay_txn_start_ht` by maintaining a set of `(applied_op_id, start_ht)` for recently applied transactions. Transactions are added into this set when they are applied and cleaned from memory (removed from `transactions_`) and are removed when `bootstrap_start_op_id` is increased past `applied_op_id`. `min_replay_txn_start_ht` is then the minimum of `start_ht` of this set and `min_running_ht`.

Since `replay_start_op_id` is only updated after flushes to disk, this ensures that any transaction whose metadata record is filtered out by the intent SST file filter will not be incorrectly loaded during WAL replay, since such a transaction would have `apply_op_id < replay_start_op_id` (the `replay_start_op_id`  calculated at bootstrap time), so none of its records are read by WAL replay.

**Upgrade/Rollback safety:**

The `min_running_ht` field in `TabletBootstrapStatePB` was retired and a new `min_replay_txn_start_ht` field was added. There are no autoflags added because `min_replay_txn_start_ht` is only used for an optimization (intent SST file filtering) so the lack of its presence post-upgrade does not change correctness, and its presence post-rollback is simply ignored. `min_running_ht` was only used for a incorrect implementation of the optimization which was off by default, so the lack of its presence post-rollback does not harm correctness (and actually improves it if optimization was turned on) and its presence after upgrade is ignored.

A different field was used for this change to ensure that values of `min_running_ht` set before upgrade are not used, since it is unsafe to use it.
Jira: DB-12794

Test Plan:
Added test case to reproduce the data loss scenario when filter was using `min_running_ht`:
```
./yb_build.sh --cxx_test pgwrapper_pg_mini-test --gtest_filter PgMiniTestSingleNode.TestBootstrapOnAppliedTransactionWithIntents
```

Also confirmed that CDC stress tests stop failing after these changes.

Reviewers: sergei, qhu

Reviewed By: sergei

Subscribers: rthallam, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D37792
jasonyb pushed a commit that referenced this issue Sep 17, 2024
Summary:
 f97d7d5 [#23770] [#23797] [#23837] YSQL: Fix most non-regress tests when run with Connection Manager
 ee2b108 [docs] reverted PR 23909 changes (#23941)
 bd80f4e [#23924] DocDB: Address recent flakiness of PgGetLockStatusTest.TestGetWaitStart
 Excluded: f0a5db7 [#20908] YSQL: Introduce interface to optimize index non-key column updates
 6556498 [#23897] xClusterDDLRepl: Support create table with partition by primary key
 e2b1d28 [#23363] Changing the default gflags of YSQL memory configuration.
 e717f43 [#23925] DocDB: Address recent regression of test PgWaitQueuesTest.MultiTabletFairness
 09b7702 [#23940] YSQL: Replace copyright string from YugaByteDB to YugabyteDB
 add83ef [#23947] Update callhome URL to use https
 69db717 Update faq page (#23704)
 Excluded: 063dbe5 [#23786] YSQL: yb_make_next_ddl_statement_nonincrementing
 Excluded: a913524 [#23859] YSQL: Remove redundant Bitmap Scan filters on partial indexes
 41f5afd [PLAT-15097]: Delete System platform DB table while disabling ysql
 d6bbf59 [#23890] docdb: Add filtering for bootstrap intent iterators based on min_replay_txn_start_ht
 0b37479 [#23770] YSQL: Stabalize TestPgExplainAnalyze#testExplainAnalyzeOptions the test with ysql connection manager

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D38123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants