Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logical Replication transactions applied multiple times #6977

Closed
Tracked by #1632
save-buffer opened this issue Feb 29, 2024 · 4 comments
Closed
Tracked by #1632

Logical Replication transactions applied multiple times #6977

save-buffer opened this issue Feb 29, 2024 · 4 comments
Assignees
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@save-buffer
Copy link
Contributor

save-buffer commented Feb 29, 2024

Steps to reproduce

Make Neon a subscriber to some workload that produces several inserts/sec on a table with a primary key. Then restart it. Replication will fail because of duplicate key insert.

Expected result

No duplicate key insert

Actual result

Duplicate key insert

Environment

Logs, links

Discussion: https://neondb.slack.com/archives/C04DGM6SMTM/p1708363190710839

It seems that pageserver isn't applying advancing replorigin, and so the compute's origin when it restarts is whatever was in the last checkpoint.

@save-buffer save-buffer added the t/bug Issue Type: Bug label Feb 29, 2024
@kelvich
Copy link
Contributor

kelvich commented Mar 12, 2024

To achieve exactly once semantics for incoming logical replication postgres stamps each commit record with a 10-byte tuple of (origin_id, origin_lsn), where origin_id is two byte unique identifier of the replication source and origin_lsn is 8-byte lsn on the source side. Postgres maintains in-memory mapping of origin_id => origin_lsn and writes it to disk on each checkpoint including a shutdown checkpoint. On a normal start postgres would just read this disk state. During start after a crash postgres will load last on-disk state and on WAL replay it will update in-memory values on each commit record to end up in a consistent state.

That way postgres always has knowledge of remote lsn of last applied record and in case of replication reconnect it can ask to continue replication from a proper point. For context: postgres replication is somewhat pull model -- it is on replica to reconnect and ask for data starting with some lsn.

How can we support it in Neon:

  1. There are already wal records for creating and dropping replication origins (xl_replorigin_set / xl_replorigin_drop). No need to generate aux file logical messages for repl. origins.
  2. We store all repl. origins in one more directory-like structure. Could be sorted array of [[origin_id, lsn],...] (denser) or hashmap as before.
  3. Commit records serve as update records for repl. origins. We maintain in-memory commit counter and each 1k commits we write out new state of repl. origins in case there were some changes.
  4. On basebackup:
    • read disk state of repl. origins
    • find last 1k of commit records (IIUC they are now attached as updates to CLOG and we somewhat simulate range scan?) and replay them to get latest origin_lsn values
    • include up to date files in basebackup archive

@knizhnik
Copy link
Contributor

To achieve exactly once semantics for incoming logical replication postgres stamps each commit record with a 10-byte tuple of (origin_id, origin_lsn), where origin_id is two byte unique identifier of the replication source and origin_lsn is 8-byte lsn on the source side.

Formally speaking it is not precisely true.
origin_id is not specified I commit record.
It is special kind of block_id (XLR_BLOCK_ID_DATA_SHORT, XLR_BLOCK_ID_DATA_LONG,... XLR_BLOCK_ID_DATA_SHORT).

Commit record may include origin_lsn&origin_timestamp (16 bytes totally) if XACT_XINFO_HAS_HAS_ORIGIN but is set in xinfo. But it is not principle.

@knizhnik
Copy link
Contributor

knizhnik commented Mar 12, 2024

There are already wal records for creating and dropping replication origins (xl_replorigin_set / xl_replorigin_drop). No need to generate aux file logical messages for repl. origins.

Well, presence of WAL record doesn't exclude necessity to have key with which this WAL record will be associated. Please notice that our storage is key-value storage. For example now commit records are associated with CLOG pages.
AUX_FILE is such key with which xl_replorigin_set / xl_replorigin_drop can be associated (right now them are not stored as WAL records but just as single-file entry, but in principle it is the same).

We store all repl. origins in one more directory-like structure. Could be sorted array of [[origin_id, lsn],...] (denser) or hashmap as before.

No problem with it. I have already implemented this part.

Commit records serve as update records for repl. origins. We maintain in-memory commit counter and each 1k commits we write out new state of repl. origins in case there were some changes.

Do you mean that we should add REPL_ORIGIN key and associate commit records with it? It means that we need to write commit in two places: CLOG and repl. origin.

On basebackup:
read disk state of repl. origins
find last 1k of commit records (IIUC they are now attached as updates to CLOG and we somewhat simulate range scan?) > and replay them to get latest origin_lsn values
include up to date files in basebackup archiveConcerning your proposal (1-4).

There is no any efficient way to locate N last commits.
Right now original commit records are not stored in KV storage at all. Walingest stores its own CLOG update records.

But I do not understand what do we need to load N last commits at all. If we introduce REPL_ORIGIN(id) key, then we can associate repl origin updates with this page. And we need to load just one value (preceding basebackup LSN).

The question is how to find all origins. In Postgres each slot has its own origin and we just iterate through all slots. PS knows nothing about replication slots. But it has AUX file with slot state. In principle it can get this files, parse it and so get array of origins. Alternatively we can use range scan to find all available REPL_ORIGIN(id) keys. Last approach seems to be less efficient but easier to implement and doesn't require PS to know format of replication slot state file.

knizhnik pushed a commit that referenced this issue Mar 12, 2024
Store logical replication origin in KV storage
knizhnik pushed a commit that referenced this issue May 8, 2024
Store logical replication origin in KV storage
knizhnik pushed a commit that referenced this issue May 9, 2024
Store logical replication origin in KV storage
knizhnik pushed a commit that referenced this issue Jun 1, 2024
Store logical replication origin in KV storage
knizhnik pushed a commit that referenced this issue Jun 3, 2024
Store logical replication origin in KV storage
knizhnik added a commit that referenced this issue Jun 3, 2024
Store logical replication origin in KV storage

## Problem

See  #6977

## Summary of changes

* Extract origin_lsn from commit WAl record
* Add ReplOrigin key to KV storage and store origin_lsn
* In basebackup replace snapshot origin_lsn with last committed
origin_lsn at basebackup LSN

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>
@hlinnaka
Copy link
Contributor

This was fixed by #7099

@stepashka stepashka added the c/compute Component: compute, excluding postgres itself label Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

5 participants