Consume fewer XIDs when restarting primary

The pageserver tracks the latest XID seen in the WAL, in the nextXid field in the "checkpoint" key-value pair. To reduce the churn on that single storage key, it's not tracked exactly. Rather, when we advance it, we always advance it to the next multiple of 1024 XIDs. That way, we only need to insert a new checkpoint value to the storage every 1024 transactions. However, read-only replicas now scan the WAL at startup, to find any XIDs that haven't been explicitly aborted or committed, and treats them as still in-progress (PR #7288). When we bump up the nextXid counter by 1024, all those skipped XID look like in-progress XIDs to a read replica. There's a limited amount of space for tracking in-progress XIDs, so there's more cost ot skipping XIDs now. We had a case in production where a read replica did not start up, because the primary had gone through many restart cycles without writing any running-xacts or checkpoint WAL records, and each restart added almost 1024 "orphaned" XIDs that had to be tracked as in-progress in the replica. As soon as the primary writes a running-xacts or checkpoint record, the orphaned XIDs can be removed from the in-progress XIDs list and hte problem resolves, but if those recors are not written, the orphaned XIDs just accumulate. We should work harder to make sure that a running-xacts or checkpoint record is written at primary startup or shutdown. But at the same time, we can just make XID_CHECKPOINT_INTERVAL smaller, to consume fewer XIDs in such scenarios. That means that we will generate more versions of the checkpoint key-value pair in the storage, but we haven't seen any problems with that so it's probably fine to go from 1024 to 128.
neondatabase · Jul 5, 2024 · 62b1e07 · 62b1e07
1 parent c9fd8d7
commit 62b1e07
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 9 deletions.
diff --git a/libs/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs
@@ -55,7 +55,7 @@ pub const SIZE_OF_XLOG_RECORD_DATA_HEADER_SHORT: usize = 1 * 2;
 /// metadata checkpoint only once per XID_CHECKPOINT_INTERVAL transactions.
 /// XID_CHECKPOINT_INTERVAL should not be larger than BLCKSZ*CLOG_XACTS_PER_BYTE
 /// in order to let CLOG_TRUNCATE mechanism correctly extend CLOG.
-const XID_CHECKPOINT_INTERVAL: u32 = 1024;
+const XID_CHECKPOINT_INTERVAL: u32 = 128;
 
 pub fn XLogSegmentsPerXLogId(wal_segsz_bytes: usize) -> XLogSegNo {
     (0x100000000u64 / wal_segsz_bytes as u64) as XLogSegNo

diff --git a/libs/postgres_ffi/wal_craft/src/xlog_utils_test.rs b/libs/postgres_ffi/wal_craft/src/xlog_utils_test.rs
@@ -187,19 +187,19 @@ pub fn test_update_next_xid() {
     // The input XID gets rounded up to the next XID_CHECKPOINT_INTERVAL
     // boundary
     checkpoint.update_next_xid(100);
-    assert_eq!(checkpoint.nextXid.value, 1024);
+    assert_eq!(checkpoint.nextXid.value, 128);
 
     // No change
-    checkpoint.update_next_xid(500);
-    assert_eq!(checkpoint.nextXid.value, 1024);
-    checkpoint.update_next_xid(1023);
-    assert_eq!(checkpoint.nextXid.value, 1024);
+    checkpoint.update_next_xid(100);
+    assert_eq!(checkpoint.nextXid.value, 128);
+    checkpoint.update_next_xid(127);
+    assert_eq!(checkpoint.nextXid.value, 128);
 
     // The function returns the *next* XID, given the highest XID seen so
-    // far. So when we pass 1024, the nextXid gets bumped up to the next
+    // far. So when we pass 128, the nextXid gets bumped up to the next
     // XID_CHECKPOINT_INTERVAL boundary.
-    checkpoint.update_next_xid(1024);
-    assert_eq!(checkpoint.nextXid.value, 2048);
+    checkpoint.update_next_xid(128);
+    assert_eq!(checkpoint.nextXid.value, 256);
 }
 
 #[test]