Restore running xacts from CLOG on replica startup #7288

knizhnik · 2024-04-02T14:17:54Z

We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.

The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).

However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:

It changes the historical behavior for existing users. Currently,
the standby immediately opens up for queries, so if they now need to
wait, we can breka existing use cases that were working fine
(assuming you don't hit the MVCC issues).
The problem is much worse for Neon than it is for standalone
PostgreSQL, because in Neon, we can start a replica from an
arbitrary LSN. In standalone PostgreSQL, the replica always starts
WAL replay from a checkpoint record, and the primary arranges things
so that there is always a running-xacts record soon after each
checkpoint record. You can still hit this issue with PostgreSQL if
you have a transaction with lots of subtransactions running in the
primary, but it's pretty rare in practice.

To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.

See #7236.

hlinnaka · 2024-04-02T15:03:39Z

That's a pleasantly short patch. Still highly work-in-progress obviously, but I think this is promising.

Issues / TODOs:

Can we do this in the neon extension instead of patching StartupXLog? I think so. You can do this after startup, and call ProcArrayApplyRecoveryInfo() with the list of xids.
What to do if the list of XIDs is too large? As this stands, you'll get an "too many KnownAssignedXids" error, which isn't great.

Unless we come up with some creative solution to the "too many KnownAssignedXids" problem, I think we can only use this in limited cases, when there are only few in-progress (sub-)XIDs. But it might still be worth doing, that might cover 90% of the most common real world scenarios, as long as we can fail gracefully or fall back to something else when doesn't work out.

github-actions · 2024-04-02T15:03:44Z

3000 tests run: 2885 passed, 0 failed, 115 skipped (full report)

Flaky tests (1)

Postgres 14

test_subscriber_restart: release

Code coverage* (full report)

functions: 32.7% (6914 of 21129 functions)
lines: 50.1% (54203 of 108195 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
9ce1930 at 2024-07-01T10:45:00.748Z :recycle:}

pageserver/src/walingest.rs

knizhnik · 2024-04-02T17:46:27Z

That's a pleasantly short patch. Still highly work-in-progress obviously, but I think this is promising.

Issues / TODOs:

Can we do this in the neon extension instead of patching StartupXLog? I think so. You can do this after startup, and call ProcArrayApplyRecoveryInfo() with the list of xids.

There are still some other Neon specific changes in xlog.c/xlogrecovery.c related with startup.
So I do not thing that adding one more Neon-specific function principally change something here.

I didn't considered yet moving this code to the neon extension, may be it is possible. But it tightly interact with recovery state machine, so it may be non trivial/

What to do if the list of XIDs is too large? As this stands, you'll get an "too many KnownAssignedXids" error, which isn't great.

Unless we come up with some creative solution to the "too many KnownAssignedXids" problem, I think we can only use this in limited cases, when there are only few in-progress (sub-)XIDs. But it might still be worth doing, that might cover 90% of the most common real world scenarios, as long as we can fail gracefully or fall back to something else when doesn't work out.

I do not quite understand why this problem is specific to this particular way of propagating information about running xacts at replica startup?
If there are too much active (sub)transactions then we should get "too many KnownAssignedXids" in RecordKnownAssignedTransactionIds, should not we?

I think we can only use this in limited cases, when there are only few in-progress (sub-)XIDs.

I think that it is more correct to say that we can not use this mechanism only I limited number of cases (thousands of subtrans). Because maxKnownAssignedXids = 64*MaxBackends. But as I mentioned above, I think that if there are such larger number of active xids, then we get this error doesn't;t matter which mechanism is used by solving running-xacts problem.

koivunej

walingest changes look good, almost like a separate bugfix, but perhaps it does not have any effect without the accompanying postgres changes?

knizhnik · 2024-04-04T06:11:23Z

walingest changes look good, almost like a separate bugfix, but perhaps it does not have any effect without the accompanying postgres changes?

Yes, up-to-date value of oldestActiveXid is critical only in context of this PR when it is used to restore running xacts info.

andreasscherbaum · 2024-05-21T15:47:10Z

Discovered #7794 while working on this PR.

hlinnaka · 2024-05-22T10:04:04Z

I added two test cases to this, which are failing. They demonstrate two problems with the known-xid overflow:

If there are a lot of (sub-)XIDs running in the primary when you start the standby, we only put up to TOTAL_MAX_CACHED_SUBXIDS/2 of them in the known-xids array. The rest are lost. This is a known issue, there's the XXX comment about it in RestoreRunningXactsFromClog(). This test demonstrates the consequences of that: you get incorrect query results in the standby
Even though we leave half of the known-xids array reserved for new XIDs that we will during WAL replay, it might not be enough. If you run out of space in the array later, the standby crashes with FATAL: too many KnownAssignedXids.

vendor/revisions.json

test_runner/regress/test_replication_start_subxid_overflow.py

hlinnaka · 2024-05-22T19:06:39Z

Move this code to extension together with some other Neon changes in xlog.c code?

I think you could call RestoreRunningXactsFromClog() from neon_rm_startup. That gets called at the beginning of WAL recovery, which happens to be called at the right time, even though this has nothing to do with initializing the neon RM.

knizhnik · 2024-05-23T06:15:26Z

Move this code to extension together with some other Neon changes in xlog.c code?

I think you could call RestoreRunningXactsFromClog() from neon_rm_startup. That gets called at the beginning of WAL recovery, which happens to be called at the right time, even though this has nothing to do with initializing the neon RM.

RestoreRunningXactsFromClog does nothing more than just constructing list of TransactionId-s which is latter passed to ProcArrayApplyRecoveryInfo. Should it be also called from extension? It seems top be non-trivial, because other information needed for this function (nextXid, oldestActiveXID, latestCompletedXid) is available only in xlog.c and it is not clear how to pass it to extension. Also ProcArrayApplyRecoveryInfo should be called at proper moment and I am not sure that neon_rm_startup is such moment.

So frankly speaking I prefer to delay refactoring of xlog.c, extracting neon-specific changes from here, for some future. Right now it is more critical to fix the real problem with running xacts and RO replicas.

If we consider that restoring this information from CLOG is proper solution (or at least temporary workaround until CSN patch will be committed in vanilla), then let's decide what to do with known xids overflow: just throw error, restrict number of restored subxacts, ... I am ok with any of this solutions.

As far as I understand, the Vanilla approach to this problem is their following: if there is xubxid overflow then just do not produce running xacts WAL record and wait until there are less active subxacts. This approach should never cause "too much known xids" errors but can infinitely delay replica startup. Not sure that we can and do something similar in our case.

IMHO large number (thousands) of subxacts is very rare use case and replication is also not needed for everybody. So, combined together, probability that some customer ever faced with this problem is close to zero. Certainly it will be nice to have solution which works in 100% cases. And CSN is such solution. But trying to solve the problem within existed model seems to be just waste of time.

hlinnaka · 2024-05-23T07:09:15Z

If we consider that restoring this information from CLOG is proper solution (or at least temporary workaround until CSN patch will be committed in vanilla), then let's decide what to do with known xids overflow: just throw error, restrict number of restored subxacts, ... I am ok with any of this solutions.

+1, let's nail down what the behavior will be.

As far as I understand, the Vanilla approach to this problem is their following: if there is xubxid overflow then just do not produce running xacts WAL record and wait until there are less active subxacts.

It's a bit more subtle than that: Postgres always creates the running-xacts record, but the if the subxid caches have overflown, then the running-xacts record doesn't include the sub-XIDs that had overflown. In that case, the standby need to wait until all the overflown XIDs hae completed. When it sees another running-xacts record with a higher oldestActiveXid, then it can start accepting queries.

This approach should never cause "too much known xids" errors but can infinitely delay replica startup. Not sure that we can and do something similar in our case.

Correct. To be clear, the infinite delay would happen in the case that you have one transaction open in the primary that you never commit. If you continuously start and end transactions, the standby will eventually start up, even if you have millions of transactions open all the time.

hlinnaka · 2024-05-23T07:36:59Z

If we consider that restoring this information from CLOG is proper solution (or at least temporary workaround until CSN patch will be committed in vanilla), then let's decide what to do with known xids overflow: just throw error, restrict number of restored subxacts, ... I am ok with any of this solutions.

+1, let's nail down what the behavior will be.

I'm thinking:

If the list of XIDs scanned from CLOG don't fit in known-assigned XIDs, with enough free space so that we can be sure that we won't run out of space later during replay either, then start with that. Otherwise, bail out and wait for the next running-xacts record, like Postgres normally does.

hlinnaka · 2024-05-23T09:21:35Z

(moving this comment from #7855 to here, to keep the discussion in one place):

The tests that this adds are failing, that's expected. They demonstrate the two cases where this optimization fails: in one case the known-assigned xids area is overflown so we wait for the running-xacts record, and in the other case it's overflown later, in WAL replay, so we crash with FATAL: too many KnownAssignedXids. We should adjust the test cases so that they expect those results, and pass.

hlinnaka · 2024-05-23T09:25:52Z

There are a bunch of FIXMEs, TODOs, XXX comments here. Need to address them.

Two open items in particular:

Figure out why the checkPoint.oldestActiveXid is sometimes InvalidTransactionid. There's currently a hack to set it to FirstNormalTransactionId if it's invalid, but that doesn't sound right.
Calculate what is the max number of XIDs we can put in the known-assigned XIDs array, and still be guaranteed that we won't hit the "too many KnownAssignedXids" error later during replay. Depending on how small that number is, we might want to make the known-assigned XIDs array larger in neon, to reserve some space for the XIDs we scan from CLOG. And decide if we'd rather risk getting the "too many KnownAssignedXids" error later, or wait for running-xacts record to arrive.

knizhnik · 2024-05-23T14:18:28Z

There are a bunch of FIXMEs, TODOs, XXX comments here. Need to address them.

Calculate what is the max number of XIDs we can put in the known-assigned XIDs array, and still be guaranteed that we won't hit the "too many KnownAssignedXids" error later during replay. Depending on how small that number is, we might want to make the known-assigned XIDs array larger in neon, to reserve some space for the XIDs we scan from CLOG. And decide if we'd rather risk getting the "too many KnownAssignedXids" error later, or wait for running-xacts record to arrive.

Looked more at it. So right now situation is the following:

Procarray contains "cached subxids", their number is limited by PGPROC_MAX_CACHED_SUBXIDS=64
If number of subxacts exceeds PGPROC_MAX_CACHED_SUBXIDS then this proc entry is marked as overflowed
When GetRunningTransactionData scans procarray to extract information about running xids it sets suboverflowed=true if any of entries is overflowed. In this case nformation about subsides is NOT included in RUNNING_XACT record
Maximal size is known xids is defined as ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS). So it means that if none of transactions has more than 64 subxacts then information about running xacts should fit in known xids. So we should normally never get "too many KnownAssignedXids" with vanilla Postgres.
When starting replica see RUNNING_XACTS record with subxid_overflow it switch to state STANDBY_SNAPSHOT_PENDING and waits until either it receive non-overflow snapshot, either oldestRunningXid is later than the nextXid from the initial snapshot.

If we restore running xids from CLOG, we can fill more than PGPROC_MAX_CACHED_SUBXIDS. If after replica start we start multiple transactions with subsides, then we easily can cause overflow of KnownAssignedXids.

So looks like the only safe limit for restoring running xacts from CXLOG is PGPROC_MAX_CACHED_SUBXIDS (64).
It seems to be too small. I do not know what is better: use this safe limit which can significantly slowdown replica startup.
Or accept replica failure due to "too many KnownAssignedXids" is some very rare case.

Also please notice that all this arithmetic above took in account only subxids. But KnownAssignedXids also includes normal (top) transaction XIDs. And their number can be larger than PGPROC_MAX_CACHED_SUBXIDS.

pgxn/neon_rmgr/neon_rmgr.c

test_runner/regress/test_replication_start_subxid_overflow.py

knizhnik · 2024-05-24T07:23:50Z

@hlinnaka what are thee further steps with this PR?
Approach proposed by you (restores from CLOG as much as we can and wait for running-xacts in case of overflow) seems to be correct and works as expected. But it is not compatible with our tests! In case of overflow, replica will wait for non-overflowed running xacts. And it can arrive only after commit. So starting replica and doing commit after it is not possible.

I have changed my test to spawn thread and perform commit in this thread after some delay. I could not say that I excited by this solution because it depends on timing (sleeps). But do not know better solution.

I have to disable your test_replication_start_subxid_overflow2 because it assumes that commit is performed at some specific point.

Concerning safe value for number of restores runnings xids, as I mentioned above it should be PGPROC_MAX_CACHED_SUBXIDS=64 to avoid any "too many KnownAssignedXids" errors. But this value is really too small. Even without xacts we can have more active trasnsactions. Also with this value your test_replication_start_subxid_overflow3 test doesn't pass, because initial number of subxacts (2000) is larger than 64 and once again - replica failed to start. With current limit TOTAL_MAX_CACHED_SUBXIDS / 2 this test is passed at my note. But I can not exclude that it will fail with "too many KnownAssignedXids" error at CI.

knizhnik · 2024-05-24T13:22:00Z

There seems to be one issue with your approach of restoring running xacts from CLOG from neon_rm_startup.
I looked at failed test_2_replicas_start test. The test itself is very simple - it just spawns two replicas without inserting any data:

def test_2_replicas_start(neon_simple_env: NeonEnv):
    env = neon_simple_env

    with env.endpoints.create_start(
        branch_name="main",
        endpoint_id="primary",
    ) as primary:
        time.sleep(1)
        with env.endpoints.new_replica_start(
            origin=primary, endpoint_id="secondary1"
        ) as secondary1:
            with env.endpoints.new_replica_start(
                origin=primary, endpoint_id="secondary2"
            ) as secondary2:
                wait_replica_caughtup(primary, secondary1)
                wait_replica_caughtup(primary, secondary2)

If we look at log of first replica we can see:

2024-05-24T12:19:43.934097Z  INFO logging and tracing started
...
PG:2024-05-24 12:19:43.987 GMT [28773] LOG:  consistent recovery state reached at 0/15142F8
...
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  neon_rm_startup
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  initialized known-assigned XIDs with 0 in-progress XIDs between 3 and 730 (max 3965)
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  redo starts at 0/15142F8

Please notice 15 seconds pause before neon_rm_startup is invoked.
The reason is that RMGR-s are initialised lazily after receiving first WAL record:

/*
 * Perform WAL recovery.
 *
 * If the system was shut down cleanly, this is never called.
 */
void
PerformWalRecovery(void)
{
       ...
	if (record != NULL)
	{
		TimestampTz xtime;
		PGRUsage	ru0;

		pg_rusage_init(&ru0);

		InRedo = true;

		RmgrStartup();

		ereport(LOG,
				(errmsg("redo starts at %X/%X",
						LSN_FORMAT_ARGS(xlogreader->ReadRecPtr))));

As far as there are no updates and so far no WAL records we have to wait 15 seconds until running-xacts record is generated.

Situation with replica 2 is even worser. According to the log recovery is not started at it at all (there is no "redo starts " record in the log, although the log is similar to log of first replica. The only difference is that they reach consistent point at difference LSNs:

First replica:

PG:2024-05-24 12:19:43.987 GMT [28773] LOG:  invalid record length at 0/15142F8: expected at least 24, got 0
PG:2024-05-24 12:19:43.987 GMT [28773] LOG:  consistent recovery state reached at 0/15142F8

Second replica:

PG:2024-05-24 12:20:04.082 GMT [28786] LOG:  invalid record length at 0/1514330: expected at least 24, got 0
PG:2024-05-24 12:20:04.082 GMT [28786] LOG:  consistent recovery state reached at 0/1514330

So we wait for 15 seconds for first replica startup (until www get running xacts record).
But no more running-xacts record is generated because nothing was added to the WAL:

 			/*
			 * Only log if enough time has passed and interesting records have
			 * been inserted since the last snapshot.  Have to compare with <=
			 * instead of < because GetLastImportantRecPtr() points at the
			 * start of a record, whereas last_snapshot_lsn points just past
			 * the end of the record.
			 */
			if (now >= timeout &&
				last_snapshot_lsn <= GetLastImportantRecPtr())
			{
				last_snapshot_lsn = LogStandbySnapshot();
				last_snapshot_ts = now;
			}

So my suggestion is to return RestoreRunningXactsFromClog from neon_smgr.
Yes, it requires changes of Postgres kernel. But we are already doing it in this place.
And IMHO restoring running xacts information from CLOG looks much more natural in this place.
If you want to minimise this changes, we can register RestoreRunningXactsFromClog as callback.

hlinnaka · 2024-05-24T14:43:28Z

If we look at log of first replica we can see:

2024-05-24T12:19:43.934097Z  INFO logging and tracing started
...
PG:2024-05-24 12:19:43.987 GMT [28773] LOG:  consistent recovery state reached at 0/15142F8
...
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  neon_rm_startup
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  initialized known-assigned XIDs with 0 in-progress XIDs between 3 and 730 (max 3965)
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  redo starts at 0/15142F8

Please notice 15 seconds pause before neon_rm_startup is invoked. The reason is that RMGR-s are initialised lazily after receiving first WAL record:

/*
 * Perform WAL recovery.
 *
 * If the system was shut down cleanly, this is never called.
 */
void
PerformWalRecovery(void)
{
       ...
	if (record != NULL)
	{
		TimestampTz xtime;
		PGRUsage	ru0;

		pg_rusage_init(&ru0);

		InRedo = true;

		RmgrStartup();

		ereport(LOG,
				(errmsg("redo starts at %X/%X",
						LSN_FORMAT_ARGS(xlogreader->ReadRecPtr))));

As far as there are no updates and so far no WAL records we have to wait 15 seconds until running-xacts record is generated.

Oh. Darn. Ok, yeah we need to call the function earlier then..

So my suggestion is to return RestoreRunningXactsFromClog from neon_smgr. Yes, it requires changes of Postgres kernel. But we are already doing it in this place. And IMHO restoring running xacts information from CLOG looks much more natural in this place. If you want to minimise this changes, we can register RestoreRunningXactsFromClog as callback.

Yeah, I don't see any good way to use the existing hooks for this. +1 for registering RestoreRunningXactsFromClog as a callback.

This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)

We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See #7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

The 'leftover' boolean was replaced with a semaphore in commit f0e2bb7, but this initialization was missed. Remove it so that if a test tries to access it, you get an error rather than always claiming that the endpoint is not running. Spotted by Arseny at #7288 (comment)

The 'running' boolean was replaced with a semaphore in commit f0e2bb7, but this initialization was missed. Remove it so that if a test tries to access it, you get an error rather than always claiming that the endpoint is not running. Spotted by Arseny at #7288 (comment)

This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)

We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See #7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)

We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See #7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

The 'running' boolean was replaced with a semaphore in commit f0e2bb7, but this initialization was missed. Remove it so that if a test tries to access it, you get an error rather than always claiming that the endpoint is not running. Spotted by Arseny at #7288 (comment)

This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)

We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See #7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

The pageserver tracks the latest XID seen in the WAL, in the nextXid field in the "checkpoint" key-value pair. To reduce the churn on that single storage key, it's not tracked exactly. Rather, when we advance it, we always advance it to the next multiple of 1024 XIDs. That way, we only need to insert a new checkpoint value to the storage every 1024 transactions. However, read-only replicas now scan the WAL at startup, to find any XIDs that haven't been explicitly aborted or committed, and treats them as still in-progress (PR #7288). When we bump up the nextXid counter by 1024, all those skipped XID look like in-progress XIDs to a read replica. There's a limited amount of space for tracking in-progress XIDs, so there's more cost ot skipping XIDs now. We had a case in production where a read replica did not start up, because the primary had gone through many restart cycles without writing any running-xacts or checkpoint WAL records, and each restart added almost 1024 "orphaned" XIDs that had to be tracked as in-progress in the replica. As soon as the primary writes a running-xacts or checkpoint record, the orphaned XIDs can be removed from the in-progress XIDs list and hte problem resolves, but if those recors are not written, the orphaned XIDs just accumulate. We should work harder to make sure that a running-xacts or checkpoint record is written at primary startup or shutdown. But at the same time, we can just make XID_CHECKPOINT_INTERVAL smaller, to consume fewer XIDs in such scenarios. That means that we will generate more versions of the checkpoint key-value pair in the storage, but we haven't seen any problems with that so it's probably fine to go from 1024 to 128.

…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)

This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)

…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)

This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)

…8215) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at #7288 (comment)

knizhnik requested review from a team as code owners April 2, 2024 14:17

knizhnik requested review from save-buffer, koivunej and hlinnaka April 2, 2024 14:17

save-buffer reviewed Apr 2, 2024

View reviewed changes

pageserver/src/walingest.rs Show resolved Hide resolved

koivunej approved these changes Apr 3, 2024

View reviewed changes

andreasscherbaum mentioned this pull request Apr 4, 2024

Epic: stabilize physical replication #6211

Open

hlinnaka requested changes May 22, 2024

View reviewed changes

hlinnaka mentioned this pull request May 23, 2024

Restore running xacts from CLOG on replica startup, take two #7855

Closed

hlinnaka force-pushed the restore_running_xids branch from 2a58bfe to d99c1ab Compare May 23, 2024 09:20

knizhnik commented May 23, 2024

View reviewed changes

pgxn/neon_rmgr/neon_rmgr.c Outdated Show resolved Hide resolved

knizhnik commented May 23, 2024

View reviewed changes

test_runner/regress/test_replication_start_subxid_overflow.py Outdated Show resolved Hide resolved

knizhnik requested a review from a team as a code owner May 26, 2024 18:47

hlinnaka force-pushed the restore_running_xids branch from f4d6525 to 344aec5 Compare June 30, 2024 21:55

hlinnaka force-pushed the restore_running_xids branch from 76e9116 to 9fee0c6 Compare June 30, 2024 22:25

hlinnaka mentioned this pull request Jun 30, 2024

tests: remove a leftover 'running' flag #8216

Merged

hlinnaka approved these changes Jul 1, 2024

View reviewed changes

hlinnaka and others added 2 commits July 1, 2024 12:58

hlinnaka force-pushed the restore_running_xids branch from 39a2156 to 9ce1930 Compare July 1, 2024 09:58

hlinnaka merged commit 9ce1930 into main Jul 1, 2024
57 checks passed

hlinnaka deleted the restore_running_xids branch July 1, 2024 11:09

hlinnaka mentioned this pull request Jul 5, 2024

Consume fewer XIDs when restarting primary #8290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore running xacts from CLOG on replica startup #7288

Restore running xacts from CLOG on replica startup #7288

knizhnik commented Apr 2, 2024 •

edited by hlinnaka

Loading

hlinnaka commented Apr 2, 2024

github-actions bot commented Apr 2, 2024 •

edited

Loading

Postgres 14

knizhnik commented Apr 2, 2024

koivunej left a comment

knizhnik commented Apr 4, 2024

andreasscherbaum commented May 21, 2024

hlinnaka commented May 22, 2024

hlinnaka commented May 22, 2024

knizhnik commented May 23, 2024

hlinnaka commented May 23, 2024

hlinnaka commented May 23, 2024

hlinnaka commented May 23, 2024

hlinnaka commented May 23, 2024

knizhnik commented May 23, 2024

knizhnik commented May 24, 2024

knizhnik commented May 24, 2024

hlinnaka commented May 24, 2024

Restore running xacts from CLOG on replica startup #7288

Restore running xacts from CLOG on replica startup #7288

Conversation

knizhnik commented Apr 2, 2024 • edited by hlinnaka Loading

hlinnaka commented Apr 2, 2024

github-actions bot commented Apr 2, 2024 • edited Loading

3000 tests run: 2885 passed, 0 failed, 115 skipped (full report)

Postgres 14

Code coverage* (full report)

knizhnik commented Apr 2, 2024

koivunej left a comment

Choose a reason for hiding this comment

knizhnik commented Apr 4, 2024

andreasscherbaum commented May 21, 2024

hlinnaka commented May 22, 2024

hlinnaka commented May 22, 2024

knizhnik commented May 23, 2024

hlinnaka commented May 23, 2024

hlinnaka commented May 23, 2024

hlinnaka commented May 23, 2024

hlinnaka commented May 23, 2024

knizhnik commented May 23, 2024

knizhnik commented May 24, 2024

knizhnik commented May 24, 2024

hlinnaka commented May 24, 2024

knizhnik commented Apr 2, 2024 •

edited by hlinnaka

Loading

github-actions bot commented Apr 2, 2024 •

edited

Loading