Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tracking of the nextMulti in the pageserver's copy of CheckPoint #6528

Merged
merged 1 commit into from
Jun 30, 2024

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Jan 30, 2024

Whenever we see an XLOG_MULTIXACT_CREATE_ID WAL record, we need to
update the nextMulti and NextMultiOffset fields in the pageserver's
copy of the CheckPoint struct, to cover the new multi-XID. In
PostgreSQL, this is done by updating an in-memory struct during WAL
replay, but because in Neon you can start a compute node at any LSN,
we need to have an up-to-date value pre-calculated in the pageserver
at all times. We do the same for nextXid.

However, we had a bug in WAL ingestion code that does that: the
multi-XIDs will wrap around at 2^32, just like XIDs, so we need to do
the comparisons in a wraparound-aware fashion.

Fix that, and add tests.

Fixes issue #6520

Co-authored-by: Konstantin Knizhnik knizhnik@neon.tech

@knizhnik knizhnik requested a review from a team as a code owner January 30, 2024 13:23
@knizhnik knizhnik requested review from jcsp and hlinnaka and removed request for a team and jcsp January 30, 2024 13:23
Copy link

github-actions bot commented Jan 30, 2024

2946 tests run: 2829 passed, 0 failed, 117 skipped (full report)


Flaky tests (2)

Postgres 15

  • test_wal_restore_initdb: debug
  • test_wal_restore: debug

Code coverage* (full report)

  • functions: 32.7% (6914 of 21129 functions)
  • lines: 50.1% (54205 of 108189 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
30027d9 at 2024-06-30T23:39:49.563Z :recycle:

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a test. Something similar to the test_import_at_2bil test should do the trick.

pageserver/src/walingest.rs Outdated Show resolved Hide resolved
pageserver/src/walingest.rs Outdated Show resolved Hide resolved
@hlinnaka
Copy link
Contributor

I did some work on this to get this to completion:

  • I refactored the code changes, by creating a new update_next_multixid() function similar to update_next_xid(). It's a lot simpler than update_next_xid(), so this is perhaps overkill, but it makes it also possible to write a unit test. That might also seem like overkill, but this wraparound behavior is very easy to get subtly wrong, so better safe than sorry. It's also nice that it's more symmetrical with update_next_xid().
  • I created a unit test for the update_next_multixid() function
  • I changed the code to not skip over multi-xid 0. Per discussion, Postgres also doesn't skip over it, which is admittedly weird, but I'd prefer to stick close to Postgres unless there's a reason to do something different.
  • I created a python test too. It uses pg_resetwal to get within spitting distance of multixid/offset wraparound, and then performs some real updates to get over the line

@hlinnaka hlinnaka changed the base branch from main to refactor-import-vanilla-pg-test June 13, 2024 22:32
pageserver/src/walingest.rs Show resolved Hide resolved
test_runner/regress/test_next_xid.py Show resolved Hide resolved
@arssher
Copy link
Contributor

arssher commented Jun 14, 2024

The CI v14 failure is interesting; TruncateMultiXact does enter critical section, and CompactCheckpointerRequestQueue does indeed pallocs. Not related to neon on the first glance.

@hlinnaka
Copy link
Contributor

The CI v14 failure is interesting; TruncateMultiXact does enter critical section, and CompactCheckpointerRequestQueue does indeed pallocs. Not related to neon on the first glance.

Yep, that's a PostgreSQL bug. I posted that to pgsql-hackers: https://www.postgresql.org/message-id/ccc66933-31c1-4f6a-bf4b-45fef0d4f22e%40iki.fi

@hlinnaka hlinnaka force-pushed the refactor-import-vanilla-pg-test branch from 988cd47 to 377db4a Compare June 18, 2024 11:27
@hlinnaka hlinnaka force-pushed the refactor-import-vanilla-pg-test branch 3 times, most recently from 6816144 to d9913e4 Compare June 26, 2024 21:11
Base automatically changed from refactor-import-vanilla-pg-test to main June 26, 2024 21:54
hlinnaka added a commit that referenced this pull request Jun 28, 2024
We hit that bug in a new test being added in PR #6528. We'd get the
fix from upstream with the next minor release anyway, but cherry-pick
it now to unblock PR #6528.

Upstream commit b1ffe3ff0b.
@hlinnaka hlinnaka changed the base branch from main to cherry-pick-upstream-TruncateMultiXact-fix June 28, 2024 09:13
@hlinnaka hlinnaka force-pushed the next_multi branch 2 times, most recently from f62e7b7 to 48c0450 Compare June 28, 2024 09:21
hlinnaka added a commit that referenced this pull request Jun 28, 2024
We hit that bug in a new test being added in PR #6528. We'd get the fix
from upstream with the next minor release anyway, but cherry-pick it now
to unblock PR #6528.

Upstream commit b1ffe3ff0b.

See
#6528 (comment)
Base automatically changed from cherry-pick-upstream-TruncateMultiXact-fix to main June 28, 2024 13:47
@hlinnaka hlinnaka force-pushed the next_multi branch 3 times, most recently from 0cfd780 to 4eb4a6a Compare June 30, 2024 22:47
@hlinnaka hlinnaka changed the title Fix calculation of nextMulti Fix tracking of the nextMulti in the pageserver's copy of CheckPoint Jun 30, 2024
…6528)

Whenever we see an XLOG_MULTIXACT_CREATE_ID WAL record, we need to
update the nextMulti and NextMultiOffset fields in the pageserver's
copy of the CheckPoint struct, to cover the new multi-XID. In
PostgreSQL, this is done by updating an in-memory struct during WAL
replay, but because in Neon you can start a compute node at any LSN,
we need to have an up-to-date value pre-calculated in the pageserver
at all times. We do the same for nextXid.

However, we had a bug in WAL ingestion code that does that: the
multi-XIDs will wrap around at 2^32, just like XIDs, so we need to do
the comparisons in a wraparound-aware fashion.

Fix that, and add tests.

Fixes issue #6520

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
@hlinnaka hlinnaka merged commit 30027d9 into main Jun 30, 2024
64 checks passed
@hlinnaka hlinnaka deleted the next_multi branch June 30, 2024 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants