Consume fewer XIDs when restarting primary #8290

hlinnaka · 2024-07-05T16:45:40Z

The pageserver tracks the latest XID seen in the WAL, in the nextXid field in the "checkpoint" key-value pair. To reduce the churn on that single storage key, it's not tracked exactly. Rather, when we advance it, we always advance it to the next multiple of 1024 XIDs. That way, we only need to insert a new checkpoint value to the storage every 1024 transactions.

However, read-only replicas now scan the WAL at startup, to find any XIDs that haven't been explicitly aborted or committed, and treats them as still in-progress (PR #7288). When we bump up the nextXid counter by 1024, all those skipped XID look like in-progress XIDs to a read replica. There's a limited amount of space for tracking in-progress XIDs, so there's more cost ot skipping XIDs now. We had a case in production where a read replica did not start up, because the primary had gone through many restart cycles without writing any running-xacts or checkpoint WAL records, and each restart added almost 1024 "orphaned" XIDs that had to be tracked as in-progress in the replica. As soon as the primary writes a running-xacts or checkpoint record, the orphaned XIDs can be removed from the in-progress XIDs list and hte problem resolves, but if those recors are not written, the orphaned XIDs just accumulate.

We should work harder to make sure that a running-xacts or checkpoint record is written at primary startup or shutdown. But at the same time, we can just make XID_CHECKPOINT_INTERVAL smaller, to consume fewer XIDs in such scenarios. That means that we will generate more versions of the checkpoint key-value pair in the storage, but we haven't seen any problems with that so it's probably fine to go from 1024 to 128.

The pageserver tracks the latest XID seen in the WAL, in the nextXid field in the "checkpoint" key-value pair. To reduce the churn on that single storage key, it's not tracked exactly. Rather, when we advance it, we always advance it to the next multiple of 1024 XIDs. That way, we only need to insert a new checkpoint value to the storage every 1024 transactions. However, read-only replicas now scan the WAL at startup, to find any XIDs that haven't been explicitly aborted or committed, and treats them as still in-progress (PR #7288). When we bump up the nextXid counter by 1024, all those skipped XID look like in-progress XIDs to a read replica. There's a limited amount of space for tracking in-progress XIDs, so there's more cost ot skipping XIDs now. We had a case in production where a read replica did not start up, because the primary had gone through many restart cycles without writing any running-xacts or checkpoint WAL records, and each restart added almost 1024 "orphaned" XIDs that had to be tracked as in-progress in the replica. As soon as the primary writes a running-xacts or checkpoint record, the orphaned XIDs can be removed from the in-progress XIDs list and hte problem resolves, but if those recors are not written, the orphaned XIDs just accumulate. We should work harder to make sure that a running-xacts or checkpoint record is written at primary startup or shutdown. But at the same time, we can just make XID_CHECKPOINT_INTERVAL smaller, to consume fewer XIDs in such scenarios. That means that we will generate more versions of the checkpoint key-value pair in the storage, but we haven't seen any problems with that so it's probably fine to go from 1024 to 128.

hlinnaka · 2024-07-05T16:47:06Z

Storage team: any concerns from generating more churn on the single "checkpoint" key-value pair?

github-actions · 2024-07-05T17:32:18Z

3042 tests run: 2927 passed, 0 failed, 115 skipped (full report)

Flaky tests (2)

Postgres 16

test_wal_restore_initdb: release
test_wal_restore: release

Code coverage* (full report)

functions: 32.6% (6934 of 21275 functions)
lines: 50.0% (54490 of 108968 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
62b1e07 at 2024-07-05T17:32:17.337Z :recycle:}

skyzh

based on my understanding, this parameter will affect the frequency of writing to CHECKPOINT_KEY. 1024->128 means 8 times more writes. Given the checkpoint file is small, I don't think it would be a huge concern for the page server storage.

jcsp · 2024-07-08T18:29:17Z

Storage team: any concerns from generating more churn on the single "checkpoint" key-value pair?

Can we get a test on this PR that reproduces the many-transactions case that the change fixes? That would be very useful to let us instrument this case (in future we might add some key-level limit on delta depths).

I'm not sure if there are tests elsewhere for RO replicas that cover this kind of thing: feels like the original issue was probably severe enough to warrant a reproducer.

knizhnik

I wonder if we can avoid alignment at all.
It doesn't show any noticeable impact on pgbench workload in my tests...

problame · 2024-07-09T17:56:14Z

We had a case in production where a read replica did not start up, because the primary had gone through many restart cycles without writing any running-xacts or checkpoint WAL records, and each restart added almost 1024 "orphaned" XIDs that had to be tracked as in-progress in the replica.

Just so I understand correctly: the workload that caused this production problem would take 8 times as long to hit this bug after we merge this PR, correct? And the hope is that within that 8 times longer time window, the primary will write a running-xacts or checkpoint record.

Doesn't seem like a systematic fix to me, but, I don't have big concerns wrt to storage.

~~Running the following query repeatedly will now produce CheckPoint records at 8x faster rate, right? Because even readonly txns consume XIDs (?).~~

BEGIN;
-- readonly workload
COMMIT;

Edit: evidently readonly txns don't consume xids.

The struct CheckPoint is 88bytes.
That puts it on the lower end of Value size spectrum.

An InMemoryLayerInner::index for an InMemoryLayer filled with only CheckPoint records will require ~100MiB of DRAM for our 256MiB checkpoint_distance.

Maybe a stupid idea, but, why don't we allow basebackup requests only for checkpoint LSNs?

knizhnik · 2024-07-11T05:51:46Z

Just so I understand correctly: the workload that caused this production problem would take 8 times as long to hit this bug after we merge this PR, correct? And the hope is that within that 8 times longer time window, the primary will write a running-xacts or checkpoint record.

Actually now we perform fast shutdown and so write shutdown checkpoint with proper oldestActiveXid.
In most cases (if there are no prepared transactions), oldestActiveXid will be quite close to the end (nextXid).
So we do not need to traverse large part of CLOG.

So it is not correct to say that to reproduce the bug we need to perform 8 times more restarts. Most likely it will not owe reproduced at all.

knizhnik · 2024-07-11T05:55:45Z

The struct CheckPoint is 88bytes.

Size of transaction commit record is usually larger. So adding checkpoint record doesn't somehow significantly increase storage consumption. Especially if it is done each 128 transactions.

knizhnik · 2024-07-11T06:06:55Z

Maybe a stupid idea, but, why don't we allow basebackup requests only for checkpoint LSNs?

Sorry, I do not completely understand the idea.
You need to launch node at the LSN where it is finished (you can not loose any transactions).
Right now node will perform checkpoint on each normal shutdown.
But in case of abnormal termination - there will be no checkpoint.

Actually there is another solution. I wonder if @neondatabase/storage team will want to implement it.
Right now there is no any interaction between walingest (which advance nextXid) and basebackup except through our KV storage. I.e. at each 128'th XID, walingest advances and writes checkpoint record and basebackup extracts it from the storage. If walingest can update current nextXid in memory (yet another Mutex in Timeline), then basebackup can take this precise value without accessing the storage. Certainly it will work only in case when we create basebackup for primary node at most recent LSN. In this case there will be no XID gaps in CLOG at all. And we can use larger XID alignment (write checkpoints rarely) without risk to known xacts overflow.

But certainly it will complicate PS.

problame · 2024-07-12T13:49:30Z

Sorry, I do not completely understand the idea.
You need to launch node at the LSN where it is finished (you can not loose any transactions).

Let's take this off GitHub => https://neondb.slack.com/archives/C03QLRH7PPD/p1720792077031629

For the record I have no objections to this PR, and I don't expect anything to break.

hlinnaka requested review from a team and problame and removed request for a team July 5, 2024 16:45

hlinnaka requested review from a team as code owners July 5, 2024 16:45

hlinnaka requested review from knizhnik and petuhovskiy July 5, 2024 16:45

skyzh approved these changes Jul 8, 2024

View reviewed changes

knizhnik approved these changes Jul 9, 2024

View reviewed changes

problame removed their request for review July 12, 2024 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consume fewer XIDs when restarting primary #8290

Consume fewer XIDs when restarting primary #8290

hlinnaka commented Jul 5, 2024

hlinnaka commented Jul 5, 2024

github-actions bot commented Jul 5, 2024

Postgres 16

skyzh left a comment

jcsp commented Jul 8, 2024 •

edited

Loading

knizhnik left a comment

problame commented Jul 9, 2024 •

edited

Loading

knizhnik commented Jul 11, 2024

knizhnik commented Jul 11, 2024 •

edited

Loading

knizhnik commented Jul 11, 2024

problame commented Jul 12, 2024

Consume fewer XIDs when restarting primary #8290

Are you sure you want to change the base?

Consume fewer XIDs when restarting primary #8290

Conversation

hlinnaka commented Jul 5, 2024

hlinnaka commented Jul 5, 2024

github-actions bot commented Jul 5, 2024

3042 tests run: 2927 passed, 0 failed, 115 skipped (full report)

Postgres 16

Code coverage* (full report)

skyzh left a comment

Choose a reason for hiding this comment

jcsp commented Jul 8, 2024 • edited Loading

knizhnik left a comment

Choose a reason for hiding this comment

problame commented Jul 9, 2024 • edited Loading

knizhnik commented Jul 11, 2024

knizhnik commented Jul 11, 2024 • edited Loading

knizhnik commented Jul 11, 2024

problame commented Jul 12, 2024

jcsp commented Jul 8, 2024 •

edited

Loading

problame commented Jul 9, 2024 •

edited

Loading

knizhnik commented Jul 11, 2024 •

edited

Loading