Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add neon.running_xacts_overflow_policy to make it possible for RO replica to startup without primary even in case running xacts overflow #8323

Merged
merged 7 commits into from
Jul 15, 2024

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Jul 9, 2024

Problem

Right now if there are too many running xacts to be restored from CLOG at replica startup,
then replica is not trying to restore them and wait for non-overflown running-xacs WAL record from primary.
But if primary is not active, then replica will not start at all.

Too many running xacts can be caused by transactions with large number of subtractions.
But right now it can be also cause by two reasons:

  • Lack of shutdown checkpoint which updates oldestRunningXid (because of immediate shutdown)
  • nextXid alignment on 1024 boundary (which cause loosing ~1k XIDs on each restart)

Both problems are somehow addressed now.
But we have existed customers with "sparse" CLOG and lack of checkpoints.
To be able to start RO replicas for such customers I suggest to add GUC which allows replica to start even in case of subxacts overflow.

Summary of changes

Add neon.running_xacts_overflow_policy with the following values:

  • ignore: restore from CLOG last N XIDs and accept connections
  • skip: do not restore any XIDs from CXLOGbut still accept connections
  • wait: wait non-overflown running xacts record from primary node

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

…lica to startup without primary even in case running xacts overflow
@knizhnik knizhnik requested review from a team as code owners July 9, 2024 13:33
@hlinnaka
Copy link
Contributor

hlinnaka commented Jul 9, 2024

Hmm, what's the plan on how this would be used? Setting GUCs for individual endpoints is not very convenient.

Alternatively, you can briefly start and stop the primary, so that it writes the shutdown checkpoint. That seems more convenient.

Copy link

github-actions bot commented Jul 9, 2024

3060 tests run: 2945 passed, 0 failed, 115 skipped (full report)


Flaky tests (4)

Postgres 16

  • test_sharding_split_compaction[compact-shard-ancestors-enqueued]: debug
  • test_s3_eviction[0.2-False]: debug

Postgres 14

  • test_secondary_background_downloads: debug
  • test_s3_eviction[0.2-False]: release

Code coverage* (full report)

  • functions: 32.7% (6984 of 21338 functions)
  • lines: 50.1% (55001 of 109719 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
b3f0388 at 2024-07-12T20:26:40.236Z :recycle:

pgxn/neon/neon.c Show resolved Hide resolved
pgxn/neon/neon.c Show resolved Hide resolved
pgxn/neon/neon.c Show resolved Hide resolved
pgxn/neon/neon.c Outdated Show resolved Hide resolved
test_runner/regress/test_replica_start.py Show resolved Hide resolved
@knizhnik knizhnik requested a review from ololobus July 13, 2024 07:41
Copy link
Member

@ololobus ololobus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@knizhnik
Copy link
Contributor Author

LGTM

You are still not a member of @neondatabase/compute so your review is not considered as review of code owner:)

@ololobus
Copy link
Member

You are still not a member of @neondatabase/compute so your review is not considered as review of code owner:)

Added myself to the group, so I think it's good to go, unless you want me to ping someone else for review :)

@knizhnik knizhnik merged commit 8a8b83d into main Jul 15, 2024
65 checks passed
@knizhnik knizhnik deleted the ignore_running_xacts_overflow branch July 15, 2024 12:52
skyzh pushed a commit that referenced this pull request Jul 15, 2024
…lica to startup without primary even in case running xacts overflow (#8323)

## Problem

Right now if there are too many running xacts to be restored from CLOG
at replica startup,
then replica is not trying to restore them and wait for non-overflown
running-xacs WAL record from primary.
But if primary is not active, then replica will not start at all.

Too many running xacts can be caused by transactions with large number
of subtractions.
But right now it can be also cause by two reasons:
- Lack of shutdown checkpoint which updates `oldestRunningXid` (because
of immediate shutdown)
- nextXid alignment on 1024 boundary (which cause loosing ~1k XIDs on
each restart)

Both problems are somehow addressed now.
But we have existed customers with "sparse" CLOG and lack of
checkpoints.
To be able to start RO replicas for such customers I suggest to add GUC
which allows replica to start even in case of subxacts overflow.

## Summary of changes

Add `neon.running_xacts_overflow_policy` with the following values:
- ignore: restore from CLOG last N XIDs and accept connections
- skip: do not restore any XIDs from CXLOGbut still accept connections
- wait: wait non-overflown running xacts record from primary node

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants