Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore running xacts from CLOG on replica startup #7288

Merged
merged 2 commits into from
Jul 1, 2024

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Apr 2, 2024

We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.

The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).

However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:

  • It changes the historical behavior for existing users. Currently,
    the standby immediately opens up for queries, so if they now need to
    wait, we can breka existing use cases that were working fine
    (assuming you don't hit the MVCC issues).

  • The problem is much worse for Neon than it is for standalone
    PostgreSQL, because in Neon, we can start a replica from an
    arbitrary LSN. In standalone PostgreSQL, the replica always starts
    WAL replay from a checkpoint record, and the primary arranges things
    so that there is always a running-xacts record soon after each
    checkpoint record. You can still hit this issue with PostgreSQL if
    you have a transaction with lots of subtransactions running in the
    primary, but it's pretty rare in practice.

To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.

See #7236.

@knizhnik knizhnik requested review from a team as code owners April 2, 2024 14:17
@hlinnaka
Copy link
Contributor

hlinnaka commented Apr 2, 2024

That's a pleasantly short patch. Still highly work-in-progress obviously, but I think this is promising.

Issues / TODOs:

  • Can we do this in the neon extension instead of patching StartupXLog? I think so. You can do this after startup, and call ProcArrayApplyRecoveryInfo() with the list of xids.
  • What to do if the list of XIDs is too large? As this stands, you'll get an "too many KnownAssignedXids" error, which isn't great.

Unless we come up with some creative solution to the "too many KnownAssignedXids" problem, I think we can only use this in limited cases, when there are only few in-progress (sub-)XIDs. But it might still be worth doing, that might cover 90% of the most common real world scenarios, as long as we can fail gracefully or fall back to something else when doesn't work out.

Copy link

github-actions bot commented Apr 2, 2024

3000 tests run: 2885 passed, 0 failed, 115 skipped (full report)


Flaky tests (1)

Postgres 14

Code coverage* (full report)

  • functions: 32.7% (6914 of 21129 functions)
  • lines: 50.1% (54203 of 108195 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
9ce1930 at 2024-07-01T10:45:00.748Z :recycle:

@knizhnik
Copy link
Contributor Author

knizhnik commented Apr 2, 2024

That's a pleasantly short patch. Still highly work-in-progress obviously, but I think this is promising.

Issues / TODOs:

  • Can we do this in the neon extension instead of patching StartupXLog? I think so. You can do this after startup, and call ProcArrayApplyRecoveryInfo() with the list of xids.

There are still some other Neon specific changes in xlog.c/xlogrecovery.c related with startup.
So I do not thing that adding one more Neon-specific function principally change something here.

I didn't considered yet moving this code to the neon extension, may be it is possible. But it tightly interact with recovery state machine, so it may be non trivial/

  • What to do if the list of XIDs is too large? As this stands, you'll get an "too many KnownAssignedXids" error, which isn't great.

Unless we come up with some creative solution to the "too many KnownAssignedXids" problem, I think we can only use this in limited cases, when there are only few in-progress (sub-)XIDs. But it might still be worth doing, that might cover 90% of the most common real world scenarios, as long as we can fail gracefully or fall back to something else when doesn't work out.

I do not quite understand why this problem is specific to this particular way of propagating information about running xacts at replica startup?
If there are too much active (sub)transactions then we should get "too many KnownAssignedXids" in RecordKnownAssignedTransactionIds, should not we?

I think we can only use this in limited cases, when there are only few in-progress (sub-)XIDs.

I think that it is more correct to say that we can not use this mechanism only I limited number of cases (thousands of subtrans). Because maxKnownAssignedXids = 64*MaxBackends. But as I mentioned above, I think that if there are such larger number of active xids, then we get this error doesn't;t matter which mechanism is used by solving running-xacts problem.

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walingest changes look good, almost like a separate bugfix, but perhaps it does not have any effect without the accompanying postgres changes?

@knizhnik
Copy link
Contributor Author

knizhnik commented Apr 4, 2024

walingest changes look good, almost like a separate bugfix, but perhaps it does not have any effect without the accompanying postgres changes?

Yes, up-to-date value of oldestActiveXid is critical only in context of this PR when it is used to restore running xacts info.

@andreasscherbaum
Copy link
Contributor

Discovered #7794 while working on this PR.

@hlinnaka
Copy link
Contributor

I added two test cases to this, which are failing. They demonstrate two problems with the known-xid overflow:

  1. If there are a lot of (sub-)XIDs running in the primary when you start the standby, we only put up to TOTAL_MAX_CACHED_SUBXIDS/2 of them in the known-xids array. The rest are lost. This is a known issue, there's the XXX comment about it in RestoreRunningXactsFromClog(). This test demonstrates the consequences of that: you get incorrect query results in the standby

  2. Even though we leave half of the known-xids array reserved for new XIDs that we will during WAL replay, it might not be enough. If you run out of space in the array later, the standby crashes with FATAL: too many KnownAssignedXids.

vendor/revisions.json Outdated Show resolved Hide resolved
@hlinnaka
Copy link
Contributor

Move this code to extension together with some other Neon changes in xlog.c code?

I think you could call RestoreRunningXactsFromClog() from neon_rm_startup. That gets called at the beginning of WAL recovery, which happens to be called at the right time, even though this has nothing to do with initializing the neon RM.

@knizhnik
Copy link
Contributor Author

Move this code to extension together with some other Neon changes in xlog.c code?

I think you could call RestoreRunningXactsFromClog() from neon_rm_startup. That gets called at the beginning of WAL recovery, which happens to be called at the right time, even though this has nothing to do with initializing the neon RM.

RestoreRunningXactsFromClog does nothing more than just constructing list of TransactionId-s which is latter passed to ProcArrayApplyRecoveryInfo. Should it be also called from extension? It seems top be non-trivial, because other information needed for this function (nextXid, oldestActiveXID, latestCompletedXid) is available only in xlog.c and it is not clear how to pass it to extension. Also ProcArrayApplyRecoveryInfo should be called at proper moment and I am not sure that neon_rm_startup is such moment.

So frankly speaking I prefer to delay refactoring of xlog.c, extracting neon-specific changes from here, for some future. Right now it is more critical to fix the real problem with running xacts and RO replicas.

If we consider that restoring this information from CLOG is proper solution (or at least temporary workaround until CSN patch will be committed in vanilla), then let's decide what to do with known xids overflow: just throw error, restrict number of restored subxacts, ... I am ok with any of this solutions.

As far as I understand, the Vanilla approach to this problem is their following: if there is xubxid overflow then just do not produce running xacts WAL record and wait until there are less active subxacts. This approach should never cause "too much known xids" errors but can infinitely delay replica startup. Not sure that we can and do something similar in our case.

IMHO large number (thousands) of subxacts is very rare use case and replication is also not needed for everybody. So, combined together, probability that some customer ever faced with this problem is close to zero. Certainly it will be nice to have solution which works in 100% cases. And CSN is such solution. But trying to solve the problem within existed model seems to be just waste of time.

@hlinnaka
Copy link
Contributor

If we consider that restoring this information from CLOG is proper solution (or at least temporary workaround until CSN patch will be committed in vanilla), then let's decide what to do with known xids overflow: just throw error, restrict number of restored subxacts, ... I am ok with any of this solutions.

+1, let's nail down what the behavior will be.

As far as I understand, the Vanilla approach to this problem is their following: if there is xubxid overflow then just do not produce running xacts WAL record and wait until there are less active subxacts.

It's a bit more subtle than that: Postgres always creates the running-xacts record, but the if the subxid caches have overflown, then the running-xacts record doesn't include the sub-XIDs that had overflown. In that case, the standby need to wait until all the overflown XIDs hae completed. When it sees another running-xacts record with a higher oldestActiveXid, then it can start accepting queries.

This approach should never cause "too much known xids" errors but can infinitely delay replica startup. Not sure that we can and do something similar in our case.

Correct. To be clear, the infinite delay would happen in the case that you have one transaction open in the primary that you never commit. If you continuously start and end transactions, the standby will eventually start up, even if you have millions of transactions open all the time.

@hlinnaka
Copy link
Contributor

If we consider that restoring this information from CLOG is proper solution (or at least temporary workaround until CSN patch will be committed in vanilla), then let's decide what to do with known xids overflow: just throw error, restrict number of restored subxacts, ... I am ok with any of this solutions.

+1, let's nail down what the behavior will be.

I'm thinking:

If the list of XIDs scanned from CLOG don't fit in known-assigned XIDs, with enough free space so that we can be sure that we won't run out of space later during replay either, then start with that. Otherwise, bail out and wait for the next running-xacts record, like Postgres normally does.

@hlinnaka
Copy link
Contributor

(moving this comment from #7855 to here, to keep the discussion in one place):

The tests that this adds are failing, that's expected. They demonstrate the two cases where this optimization fails: in one case the known-assigned xids area is overflown so we wait for the running-xacts record, and in the other case it's overflown later, in WAL replay, so we crash with FATAL: too many KnownAssignedXids. We should adjust the test cases so that they expect those results, and pass.

@hlinnaka
Copy link
Contributor

There are a bunch of FIXMEs, TODOs, XXX comments here. Need to address them.

Two open items in particular:

  • Figure out why the checkPoint.oldestActiveXid is sometimes InvalidTransactionid. There's currently a hack to set it to FirstNormalTransactionId if it's invalid, but that doesn't sound right.
  • Calculate what is the max number of XIDs we can put in the known-assigned XIDs array, and still be guaranteed that we won't hit the "too many KnownAssignedXids" error later during replay. Depending on how small that number is, we might want to make the known-assigned XIDs array larger in neon, to reserve some space for the XIDs we scan from CLOG. And decide if we'd rather risk getting the "too many KnownAssignedXids" error later, or wait for running-xacts record to arrive.

@knizhnik
Copy link
Contributor Author

There are a bunch of FIXMEs, TODOs, XXX comments here. Need to address them.

  • Calculate what is the max number of XIDs we can put in the known-assigned XIDs array, and still be guaranteed that we won't hit the "too many KnownAssignedXids" error later during replay. Depending on how small that number is, we might want to make the known-assigned XIDs array larger in neon, to reserve some space for the XIDs we scan from CLOG. And decide if we'd rather risk getting the "too many KnownAssignedXids" error later, or wait for running-xacts record to arrive.

Looked more at it. So right now situation is the following:

  • Procarray contains "cached subxids", their number is limited by PGPROC_MAX_CACHED_SUBXIDS=64
  • If number of subxacts exceeds PGPROC_MAX_CACHED_SUBXIDS then this proc entry is marked as overflowed
  • When GetRunningTransactionData scans procarray to extract information about running xids it sets suboverflowed=true if any of entries is overflowed. In this case nformation about subsides is NOT included in RUNNING_XACT record
  • Maximal size is known xids is defined as ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS). So it means that if none of transactions has more than 64 subxacts then information about running xacts should fit in known xids. So we should normally never get "too many KnownAssignedXids" with vanilla Postgres.
  • When starting replica see RUNNING_XACTS record with subxid_overflow it switch to state STANDBY_SNAPSHOT_PENDING and waits until either it receive non-overflow snapshot, either oldestRunningXid is later than the nextXid from the initial snapshot.

If we restore running xids from CLOG, we can fill more than PGPROC_MAX_CACHED_SUBXIDS. If after replica start we start multiple transactions with subsides, then we easily can cause overflow of KnownAssignedXids.

So looks like the only safe limit for restoring running xacts from CXLOG is PGPROC_MAX_CACHED_SUBXIDS (64).
It seems to be too small. I do not know what is better: use this safe limit which can significantly slowdown replica startup.
Or accept replica failure due to "too many KnownAssignedXids" is some very rare case.

Also please notice that all this arithmetic above took in account only subxids. But KnownAssignedXids also includes normal (top) transaction XIDs. And their number can be larger than PGPROC_MAX_CACHED_SUBXIDS.

pgxn/neon_rmgr/neon_rmgr.c Outdated Show resolved Hide resolved
@knizhnik
Copy link
Contributor Author

@hlinnaka what are thee further steps with this PR?
Approach proposed by you (restores from CLOG as much as we can and wait for running-xacts in case of overflow) seems to be correct and works as expected. But it is not compatible with our tests! In case of overflow, replica will wait for non-overflowed running xacts. And it can arrive only after commit. So starting replica and doing commit after it is not possible.

I have changed my test to spawn thread and perform commit in this thread after some delay. I could not say that I excited by this solution because it depends on timing (sleeps). But do not know better solution.

I have to disable your test_replication_start_subxid_overflow2 because it assumes that commit is performed at some specific point.

Concerning safe value for number of restores runnings xids, as I mentioned above it should be PGPROC_MAX_CACHED_SUBXIDS=64 to avoid any "too many KnownAssignedXids" errors. But this value is really too small. Even without xacts we can have more active trasnsactions. Also with this value your test_replication_start_subxid_overflow3 test doesn't pass, because initial number of subxacts (2000) is larger than 64 and once again - replica failed to start. With current limit TOTAL_MAX_CACHED_SUBXIDS / 2 this test is passed at my note. But I can not exclude that it will fail with "too many KnownAssignedXids" error at CI.

@knizhnik
Copy link
Contributor Author

There seems to be one issue with your approach of restoring running xacts from CLOG from neon_rm_startup.
I looked at failed test_2_replicas_start test. The test itself is very simple - it just spawns two replicas without inserting any data:

def test_2_replicas_start(neon_simple_env: NeonEnv):
    env = neon_simple_env

    with env.endpoints.create_start(
        branch_name="main",
        endpoint_id="primary",
    ) as primary:
        time.sleep(1)
        with env.endpoints.new_replica_start(
            origin=primary, endpoint_id="secondary1"
        ) as secondary1:
            with env.endpoints.new_replica_start(
                origin=primary, endpoint_id="secondary2"
            ) as secondary2:
                wait_replica_caughtup(primary, secondary1)
                wait_replica_caughtup(primary, secondary2)

If we look at log of first replica we can see:

2024-05-24T12:19:43.934097Z  INFO logging and tracing started
...
PG:2024-05-24 12:19:43.987 GMT [28773] LOG:  consistent recovery state reached at 0/15142F8
...
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  neon_rm_startup
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  initialized known-assigned XIDs with 0 in-progress XIDs between 3 and 730 (max 3965)
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  redo starts at 0/15142F8

Please notice 15 seconds pause before neon_rm_startup is invoked.
The reason is that RMGR-s are initialised lazily after receiving first WAL record:

/*
 * Perform WAL recovery.
 *
 * If the system was shut down cleanly, this is never called.
 */
void
PerformWalRecovery(void)
{
       ...
	if (record != NULL)
	{
		TimestampTz xtime;
		PGRUsage	ru0;

		pg_rusage_init(&ru0);

		InRedo = true;

		RmgrStartup();

		ereport(LOG,
				(errmsg("redo starts at %X/%X",
						LSN_FORMAT_ARGS(xlogreader->ReadRecPtr))));

As far as there are no updates and so far no WAL records we have to wait 15 seconds until running-xacts record is generated.

Situation with replica 2 is even worser. According to the log recovery is not started at it at all (there is no "redo starts " record in the log, although the log is similar to log of first replica. The only difference is that they reach consistent point at difference LSNs:

First replica:

PG:2024-05-24 12:19:43.987 GMT [28773] LOG:  invalid record length at 0/15142F8: expected at least 24, got 0
PG:2024-05-24 12:19:43.987 GMT [28773] LOG:  consistent recovery state reached at 0/15142F8

Second replica:

PG:2024-05-24 12:20:04.082 GMT [28786] LOG:  invalid record length at 0/1514330: expected at least 24, got 0
PG:2024-05-24 12:20:04.082 GMT [28786] LOG:  consistent recovery state reached at 0/1514330

So we wait for 15 seconds for first replica startup (until www get running xacts record).
But no more running-xacts record is generated because nothing was added to the WAL:

 			/*
			 * Only log if enough time has passed and interesting records have
			 * been inserted since the last snapshot.  Have to compare with <=
			 * instead of < because GetLastImportantRecPtr() points at the
			 * start of a record, whereas last_snapshot_lsn points just past
			 * the end of the record.
			 */
			if (now >= timeout &&
				last_snapshot_lsn <= GetLastImportantRecPtr())
			{
				last_snapshot_lsn = LogStandbySnapshot();
				last_snapshot_ts = now;
			}

So my suggestion is to return RestoreRunningXactsFromClog from neon_smgr.
Yes, it requires changes of Postgres kernel. But we are already doing it in this place.
And IMHO restoring running xacts information from CLOG looks much more natural in this place.
If you want to minimise this changes, we can register RestoreRunningXactsFromClog as callback.

@hlinnaka
Copy link
Contributor

If we look at log of first replica we can see:

2024-05-24T12:19:43.934097Z  INFO logging and tracing started
...
PG:2024-05-24 12:19:43.987 GMT [28773] LOG:  consistent recovery state reached at 0/15142F8
...
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  neon_rm_startup
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  initialized known-assigned XIDs with 0 in-progress XIDs between 3 and 730 (max 3965)
PG:2024-05-24 12:20:03.854 GMT [28773] LOG:  redo starts at 0/15142F8

Please notice 15 seconds pause before neon_rm_startup is invoked. The reason is that RMGR-s are initialised lazily after receiving first WAL record:

/*
 * Perform WAL recovery.
 *
 * If the system was shut down cleanly, this is never called.
 */
void
PerformWalRecovery(void)
{
       ...
	if (record != NULL)
	{
		TimestampTz xtime;
		PGRUsage	ru0;

		pg_rusage_init(&ru0);

		InRedo = true;

		RmgrStartup();

		ereport(LOG,
				(errmsg("redo starts at %X/%X",
						LSN_FORMAT_ARGS(xlogreader->ReadRecPtr))));

As far as there are no updates and so far no WAL records we have to wait 15 seconds until running-xacts record is generated.

Oh. Darn. Ok, yeah we need to call the function earlier then..

So my suggestion is to return RestoreRunningXactsFromClog from neon_smgr. Yes, it requires changes of Postgres kernel. But we are already doing it in this place. And IMHO restoring running xacts information from CLOG looks much more natural in this place. If you want to minimise this changes, we can register RestoreRunningXactsFromClog as callback.

Yeah, I don't see any good way to use the existing hooks for this. +1 for registering RestoreRunningXactsFromClog as a callback.

@knizhnik knizhnik requested a review from a team as a code owner May 26, 2024 18:47
hlinnaka added a commit that referenced this pull request Jun 30, 2024
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the
comments: we need to be back off to the beginning of a page if the
previous record ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
hlinnaka added a commit that referenced this pull request Jun 30, 2024
We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.

The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).

However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:

* It changes the historical behavior for existing users. Currently,
  the standby immediately opens up for queries, so if they now need to
  wait, we can breka existing use cases that were working fine
  (assuming you don't hit the MVCC issues).

* The problem is much worse for Neon than it is for standalone
  PostgreSQL, because in Neon, we can start a replica from an
  arbitrary LSN. In standalone PostgreSQL, the replica always starts
  WAL replay from a checkpoint record, and the primary arranges things
  so that there is always a running-xacts record soon after each
  checkpoint record. You can still hit this issue with PostgreSQL if
  you have a transaction with lots of subtransactions running in the
  primary, but it's pretty rare in practice.

To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.

See #7236.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
hlinnaka added a commit that referenced this pull request Jun 30, 2024
The 'leftover' boolean was replaced with a semaphore in commit
f0e2bb7, but this initialization was missed. Remove it so that if a
test tries to access it, you get an error rather than always claiming
that the endpoint is not running.

Spotted by Arseny at
#7288 (comment)
hlinnaka added a commit that referenced this pull request Jun 30, 2024
The 'running' boolean was replaced with a semaphore in commit
f0e2bb7, but this initialization was missed. Remove it so that if a
test tries to access it, you get an error rather than always claiming
that the endpoint is not running.

Spotted by Arseny at
#7288 (comment)
hlinnaka added a commit that referenced this pull request Jun 30, 2024
The 'running' boolean was replaced with a semaphore in commit
f0e2bb7, but this initialization was missed. Remove it so that if a
test tries to access it, you get an error rather than always claiming
that the endpoint is not running.

Spotted by Arseny at
#7288 (comment)
hlinnaka added a commit that referenced this pull request Jul 1, 2024
The 'running' boolean was replaced with a semaphore in commit
f0e2bb7, but this initialization was missed. Remove it so that if a
test tries to access it, you get an error rather than always claiming
that the endpoint is not running.

Spotted by Arseny at
#7288 (comment)
hlinnaka added a commit that referenced this pull request Jul 1, 2024
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the
comments: we need to be back off to the beginning of a page if the
previous record ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
hlinnaka and others added 2 commits July 1, 2024 12:58
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the
comments: we need to be back off to the beginning of a page if the
previous record ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.

The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).

However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:

* It changes the historical behavior for existing users. Currently,
  the standby immediately opens up for queries, so if they now need to
  wait, we can breka existing use cases that were working fine
  (assuming you don't hit the MVCC issues).

* The problem is much worse for Neon than it is for standalone
  PostgreSQL, because in Neon, we can start a replica from an
  arbitrary LSN. In standalone PostgreSQL, the replica always starts
  WAL replay from a checkpoint record, and the primary arranges things
  so that there is always a running-xacts record soon after each
  checkpoint record. You can still hit this issue with PostgreSQL if
  you have a transaction with lots of subtransactions running in the
  primary, but it's pretty rare in practice.

To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.

See #7236.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
@hlinnaka hlinnaka merged commit 9ce1930 into main Jul 1, 2024
57 checks passed
@hlinnaka hlinnaka deleted the restore_running_xids branch July 1, 2024 11:09
hlinnaka added a commit that referenced this pull request Jul 1, 2024
…8215)

This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the comments:
we need to be back off to the beginning of a page if the previous record
ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
arssher pushed a commit that referenced this pull request Jul 4, 2024
We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.

The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).

However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:

* It changes the historical behavior for existing users. Currently,
  the standby immediately opens up for queries, so if they now need to
  wait, we can breka existing use cases that were working fine
  (assuming you don't hit the MVCC issues).

* The problem is much worse for Neon than it is for standalone
  PostgreSQL, because in Neon, we can start a replica from an
  arbitrary LSN. In standalone PostgreSQL, the replica always starts
  WAL replay from a checkpoint record, and the primary arranges things
  so that there is always a running-xacts record soon after each
  checkpoint record. You can still hit this issue with PostgreSQL if
  you have a transaction with lots of subtransactions running in the
  primary, but it's pretty rare in practice.

To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.

See #7236.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
arssher pushed a commit that referenced this pull request Jul 4, 2024
The 'running' boolean was replaced with a semaphore in commit
f0e2bb7, but this initialization was missed. Remove it so that if a
test tries to access it, you get an error rather than always claiming
that the endpoint is not running.

Spotted by Arseny at
#7288 (comment)
arssher pushed a commit that referenced this pull request Jul 4, 2024
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the
comments: we need to be back off to the beginning of a page if the
previous record ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
arssher pushed a commit that referenced this pull request Jul 4, 2024
We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.

The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).

However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:

* It changes the historical behavior for existing users. Currently,
  the standby immediately opens up for queries, so if they now need to
  wait, we can breka existing use cases that were working fine
  (assuming you don't hit the MVCC issues).

* The problem is much worse for Neon than it is for standalone
  PostgreSQL, because in Neon, we can start a replica from an
  arbitrary LSN. In standalone PostgreSQL, the replica always starts
  WAL replay from a checkpoint record, and the primary arranges things
  so that there is always a running-xacts record soon after each
  checkpoint record. You can still hit this issue with PostgreSQL if
  you have a transaction with lots of subtransactions running in the
  primary, but it's pretty rare in practice.

To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.

See #7236.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
hlinnaka added a commit that referenced this pull request Jul 5, 2024
The pageserver tracks the latest XID seen in the WAL, in the nextXid
field in the "checkpoint" key-value pair. To reduce the churn on that
single storage key, it's not tracked exactly. Rather, when we advance
it, we always advance it to the next multiple of 1024 XIDs. That way,
we only need to insert a new checkpoint value to the storage every
1024 transactions.

However, read-only replicas now scan the WAL at startup, to find any
XIDs that haven't been explicitly aborted or committed, and treats
them as still in-progress (PR #7288). When we bump up the nextXid
counter by 1024, all those skipped XID look like in-progress XIDs to a
read replica. There's a limited amount of space for tracking
in-progress XIDs, so there's more cost ot skipping XIDs now. We had a
case in production where a read replica did not start up, because the
primary had gone through many restart cycles without writing any
running-xacts or checkpoint WAL records, and each restart added almost
1024 "orphaned" XIDs that had to be tracked as in-progress in the
replica. As soon as the primary writes a running-xacts or checkpoint
record, the orphaned XIDs can be removed from the in-progress XIDs
list and hte problem resolves, but if those recors are not written,
the orphaned XIDs just accumulate.

We should work harder to make sure that a running-xacts or checkpoint
record is written at primary startup or shutdown. But at the same
time, we can just make XID_CHECKPOINT_INTERVAL smaller, to consume
fewer XIDs in such scenarios. That means that we will generate more
versions of the checkpoint key-value pair in the storage, but we
haven't seen any problems with that so it's probably fine to go from
1024 to 128.
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
…8215)

This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the comments:
we need to be back off to the beginning of a page if the previous record
ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
…8215)

This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the comments:
we need to be back off to the beginning of a page if the previous record
ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the
comments: we need to be back off to the beginning of a page if the
previous record ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
…8215)

This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the comments:
we need to be back off to the beginning of a page if the previous record
ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the
comments: we need to be back off to the beginning of a page if the
previous record ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
…8215)

This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the comments:
we need to be back off to the beginning of a page if the previous record
ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
…8215)

This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the comments:
we need to be back off to the beginning of a page if the previous record
ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants