Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On-demand SLRU download: invalid request with request LSN 0/5000000 and not_modified_since 0/5000028 #8030

Closed
hlinnaka opened this issue Jun 12, 2024 · 0 comments
Assignees
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@hlinnaka
Copy link
Contributor

Steps to reproduce

Run test_import_at_2biltest with lazy_slru_download enabled:

diff --git a/test_runner/regress/test_next_xid.py b/test_runner/regress/test_next_xid.py
index 98fb06a0d..2b7f29d2c 100644
--- a/test_runner/regress/test_next_xid.py
+++ b/test_runner/regress/test_next_xid.py
@@ -97,7 +97,12 @@ def test_import_at_2bil(
     tenant = TenantId.generate()
     timeline = TimelineId.generate()
 
-    env.pageserver.tenant_create(tenant)
+    tenant_conf = {
+        "lazy_slru_download": True,
+        # set PITR interval to be small, so we can do GC
+        "pitr_interval": "0 s",
+    }
+    env.pageserver.tenant_create(tenant_id, conf=tenant_conf)
 
     # Take basebackup
     basebackup_dir = os.path.join(test_output_dir, "basebackup")

Expected result

Test runs succesfully

Actual result

Test fails:

ERROR test_runner/regress/test_next_xid.py::test_import_at_2bil[debug-pg16] - AssertionError: First log error on pageserver_1: (134, '2024-06-12T12:54:40.509145Z ERROR page_service_conn_main{peer_addr=[::1]:59670}:process_query{ten...

Pageserver log:

2024-06-12T12:59:59.304913Z ERROR page_service_conn_main{peer_addr=[::1]:47526}:process_query{tenant_id=bb45c11cfdbe21ea437526d22a9caadc timeline_id=406b862cba8395ba746decf305587f34}:handle_pagerequests:handle_get_slru_segment_request{kind=0 segno=2031 req_lsn=0/5000000}: error reading relation or page version: Bad request: invalid request with request LSN 0/5000000 and not_modified_since 0/5000028

Logs & initial investigation

Compute log:

PG:2024-06-12 12:59:59.302 GMT [466371] LOG:  starting PostgreSQL 16.3 (b228f20372ebcabfd7946647cb7adbd38bacb14a) on x86_64-pc-linux-gnu, compiled by gcc (Deb
ian 12.2.0-14) 12.2.0, 64-bit
PG:2024-06-12 12:59:59.302 GMT [466371] LOG:  listening on IPv4 address "127.0.0.1", port 15010
PG:2024-06-12 12:59:59.302 GMT [466371] LOG:  listening on Unix socket "/tmp/.s.PGSQL.15010"
PG:2024-06-12 12:59:59.303 GMT [466376] LOG:  [ZENITH] found 'zenith.signal' file. setting prev LSN to 0/4000100
PG:2024-06-12 12:59:59.303 GMT [466376] LOG:  database system was shut down at 2024-06-12 12:59:57 GMT
PG:2024-06-12 12:59:59.303 GMT [466376] LOG:  starting with zenith basebackup at LSN 0/5000028, prev 0/4000100
PG:2024-06-12 12:59:59.303 GMT [466376] WARNING:  database with OID 1 must be vacuumed within 17564365 transactions
PG:2024-06-12 12:59:59.303 GMT [466376] HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.​   You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
PG:2024-06-12 12:59:59.303 GMT [466376] LOG:  Continue writing WAL at 0/5000028
PG:2024-06-12 12:59:59.304 GMT [466376] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:15005' with protocol version 2
PG:2024-06-12 12:59:59.305 GMT [466376] FATAL:  [NEON_SMGR] could not read SLRU 0 segment 2031 at lsn 0/05000000
PG:2024-06-12 12:59:59.305 GMT [466376] DETAIL:  page server returned error: Bad request: invalid request with request LSN 0/5000000 and not_modified_since 0/5000028
PG:2024-06-12 12:59:59.306 GMT [466371] LOG:  startup process (PID 466376) exited with exit code 1
PG:2024-06-12 12:59:59.306 GMT [466371] LOG:  aborting startup due to startup process failure
PG:2024-06-12 12:59:59.306 GMT [466371] LOG:  database system is shut down

(The WARNING is expected with this test)

WAL dump:

$ tar xvf ./test_output/test_import_at_2bil\[debug-pg16\]/basebackup/pg_wal.tar 
000000010000000000000004
$ ./pg_install/v16/bin/pg_waldump 00000001000000000000000
pg_waldump: error: could not locate WAL file "00000001000000000000000"
$ ./pg_install/v16/bin/pg_waldump 000000010000000000000004 
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/04000028, prev 0/03DFBB58, desc: RUNNING_XACTS nextXid 2129920004 latestCompletedXid 2129920003 oldestRunningXid 2129920004
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/04000060, prev 0/04000028, desc: CHECKPOINT_ONLINE redo 0/4000028; tli 1; prev tli 1; fpw true; xid 0:2129920004; oid 24576; multi 1; offset 0; oldest xid 722 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 2129920004; online
rmgr: XLOG        len (rec/tot):     34/    34, tx:          0, lsn: 0/040000D8, prev 0/04000060, desc: BACKUP_END 0/4000028
rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn: 0/04000100, prev 0/040000D8, desc: SWITCH 

backup_manifest:

$ tail test_output/test_import_at_2bil\[debug-pg16\]/basebackup/backup_manifest 
{ "Path": "pg_xact/07EF", "Size": 73728, "Last-Modified": "2024-06-12 12:59:57 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "7850d756" },
{ "Path": "pg_multixact/offsets/0000", "Size": 8192, "Last-Modified": "2024-06-12 12:59:57 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "23464490" },
{ "Path": "pg_multixact/members/0000", "Size": 8192, "Last-Modified": "2024-06-12 12:59:54 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "23464490" },
{ "Path": "PG_VERSION", "Size": 3, "Last-Modified": "2024-06-12 12:59:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "13dca016" },
{ "Path": "global/pg_control", "Size": 8192, "Last-Modified": "2024-06-12 12:59:57 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "43872087" }
],
"WAL-Ranges": [
{ "Timeline": 1, "Start-LSN": "0/4000028", "End-LSN": "0/4000100" }
],
"Manifest-Checksum": "a9764e2cadb9c18df9ca72078ec742a11bb2e2cde48388975f3f81078ec12b3c"}

The request LSN is 0/5000000 and not_modified_since is 0/5000028. That's bogus,so you get an error; request LSN should always be >= not_modified since. However, those two values are effectively the same. 0/5000000 points to the beginning of the WAL page, while 0/5000028 points to just after the page header.

When called at standby startup, before it has replayed any records, neon_read_slru_segment function calls GetRedoStartLsn() to get the not-modified-since LSN, and nm_adjust_lsn(GetRedoStartLsn()) to get the request LSN.

@hlinnaka hlinnaka added t/bug Issue Type: Bug c/cloud/compute labels Jun 12, 2024
@hlinnaka hlinnaka self-assigned this Jun 12, 2024
hlinnaka added a commit that referenced this issue Jun 12, 2024
If a standby is started right after switching to a new WAL segment,
the request in the SLRU download request would point to the beginning
of the segment (e.g. 0/5000000), while the not-modified-since LSN
would point to just after the page header (e.g. 0/5000028). It's
effectively the same position, as there cannot be any WAL records in
between, but the pageserver rightly errors out on any request where
the request LSN < not-modified since LSN.

To fix, round down the not-modified since LSN to the beginning of the
page like the request LSN.

Fixes issue #8030
hlinnaka added a commit that referenced this issue Jun 12, 2024
If a standby is started right after switching to a new WAL segment,
the request in the SLRU download request would point to the beginning
of the segment (e.g. 0/5000000), while the not-modified-since LSN
would point to just after the page header (e.g. 0/5000028). It's
effectively the same position, as there cannot be any WAL records in
between, but the pageserver rightly errors out on any request where
the request LSN < not-modified since LSN.

To fix, round down the not-modified since LSN to the beginning of the
page like the request LSN.

Fixes issue #8030
hlinnaka added a commit that referenced this issue Jun 12, 2024
If a standby is started right after switching to a new WAL segment,
the request in the SLRU download request would point to the beginning
of the segment (e.g. 0/5000000), while the not-modified-since LSN
would point to just after the page header (e.g. 0/5000028). It's
effectively the same position, as there cannot be any WAL records in
between, but the pageserver rightly errors out on any request where
the request LSN < not-modified since LSN.

To fix, round down the not-modified since LSN to the beginning of the
page like the request LSN.

Fixes issue #8030
hlinnaka added a commit that referenced this issue Jun 12, 2024
…ry (#8031)

If a standby is started right after switching to a new WAL segment, the
request in the SLRU download request would point to the beginning of the
segment (e.g. 0/5000000), while the not-modified-since LSN would point
to just after the page header (e.g. 0/5000028). It's effectively the
same position, as there cannot be any WAL records in between, but the
pageserver rightly errors out on any request where the request LSN <
not-modified since LSN.

To fix, round down the not-modified since LSN to the beginning of the
page like the request LSN.

Fixes issue #8030
save-buffer pushed a commit that referenced this issue Jun 17, 2024
…ry (#8031)

If a standby is started right after switching to a new WAL segment, the
request in the SLRU download request would point to the beginning of the
segment (e.g. 0/5000000), while the not-modified-since LSN would point
to just after the page header (e.g. 0/5000028). It's effectively the
same position, as there cannot be any WAL records in between, but the
pageserver rightly errors out on any request where the request LSN <
not-modified since LSN.

To fix, round down the not-modified since LSN to the beginning of the
page like the request LSN.

Fixes issue #8030
@stepashka stepashka added the c/compute Component: compute, excluding postgres itself label Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants