Skip to content

Commit

Permalink
Always use Lsn::MAX as the request LSN in the primary (#7708)
Browse files Browse the repository at this point in the history
The new protocol version supports sending two LSNs to the pageserver:
request LSN and a "not_modified_since" hint. A primary always wants to
read the latest version of each page, so having two values was not
strictly necessary, and the old protocol worked fine with just the
"not_modified_since" LSN and a flag to request the latest page
version. Nevertheless, it seemed like a good idea to set the request
LSN to the current insert/flush LSN, because that's logically the page
version that the primary wants to read.

However, that made the test_gc_aggressive test case flaky. When the
primary requests a page with the last inserted or flushed LSN, it's
possible that by the time that the pageserver processes the request,
more WAL has been generated by other processes in the compute and
already digested by the pageserver. Furthermore, if the PITR horizon
in the pageserver is set to 0, and GC runs during that window, it's
possible that the GC horizon has advances past the request LSN, before
the pageserver processes the request. It is still correct to send the
latest page version in that case, because the compute either has the
page locked so the it cannot have been modified in the primary, or if
it's a prefetch request, and we will validate the LSNs when the
prefetch response is processed and discard it if the page has been
modified. But the pageserver doesn't know that and rightly complains.

To fix, modify the compute so that the primary always uses Lsn::MAX in
the requests. This reverts the primary's behavior to how the protocol
version 1 worked. In protocol version 1, there was only one LSN, the
"not_modified_since" hint, and a flag was set to read the latest page
version, whatever that might be. Requests from computes that are still
using protocol version 1 were already mapped to Lsn::MAX in the
pageserver, now we do the same with protocol version 2 for primary's
requests. (I'm a bit sad about losing the information in the
pageserver, what the last LSN was at the time that the request wa
made. We never had it with protocol version 1, but I wanted to make it
available for debugging purposes.)

Add another field, 'effective_request_lsn', to track what the flush
LSN was when the request was made. It's not sent to the pageserver,
Lsn::MAX is now used as the request LSN, but it's still needed
internally in the compute to track the validity of prefetch requests.

Fixes issue #7692
  • Loading branch information
hlinnaka committed May 14, 2024
1 parent ba20752 commit 22afaea
Show file tree
Hide file tree
Showing 3 changed files with 92 additions and 29 deletions.
12 changes: 12 additions & 0 deletions pgxn/neon/pagestore_client.h
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,18 @@ typedef struct
* the pageserver, as long as 'not_modified_since' has arrived.
*/
XLogRecPtr not_modified_since;

/*
* 'effective_request_lsn' is not included in the request that's sent to
* the pageserver, but is used to keep track of the latest LSN of when the
* request was made. In a standby server, this is always the same as the
* 'request_lsn', but in the primary we use UINT64_MAX as the
* 'request_lsn' to request the latest page version, so we need this
* separate field to remember that latest LSN was when the request was
* made. It's needed to manage prefetch request, to verify if the response
* to a prefetched request is still valid.
*/
XLogRecPtr effective_request_lsn;
} neon_request_lsns;

#if PG_MAJORVERSION_NUM < 16
Expand Down
95 changes: 66 additions & 29 deletions pgxn/neon/pagestore_smgr.c
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ compact_prefetch_buffers(void)
source_slot->response = NULL;
source_slot->my_ring_index = 0;
source_slot->request_lsns = (neon_request_lsns) {
InvalidXLogRecPtr, InvalidXLogRecPtr
InvalidXLogRecPtr, InvalidXLogRecPtr, InvalidXLogRecPtr
};

/* update bookkeeping */
Expand Down Expand Up @@ -1529,6 +1529,7 @@ neon_get_request_lsns(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno)
/* Request the page at the last replayed LSN. */
result.request_lsn = GetXLogReplayRecPtr(NULL);
result.not_modified_since = last_written_lsn;
result.effective_request_lsn = result.request_lsn;
Assert(last_written_lsn <= result.request_lsn);

neon_log(DEBUG1, "neon_get_request_lsns request lsn %X/%X, not_modified_since %X/%X",
Expand Down Expand Up @@ -1570,15 +1571,30 @@ neon_get_request_lsns(NRelFileInfo rinfo, ForkNumber forknum, BlockNumber blkno)
}

/*
* Request the latest version of the page. The most up-to-date request
* LSN we could use would be the current insert LSN, but to avoid the
* overhead of looking it up, use 'flushlsn' instead. This relies on
* the assumption that if the page was modified since the last WAL
* flush, it should still be in the buffer cache, and we wouldn't be
* requesting it.
* Request the very latest version of the page. In principle we
* want to read the page at the current insert LSN, and we could
* use that value in the request. However, there's a corner case
* with pageserver's garbage collection. If the GC horizon is
* set to a very small value, it's possible that by the time
* that the pageserver processes our request, the GC horizon has
* already moved past the LSN we calculate here. Standby servers
* always have that problem as the can always lag behind the
* primary, but for the primary we can avoid it by always
* requesting the latest page, by setting request LSN to
* UINT64_MAX.
*
* Remember the current LSN, however, so that we can later
* correctly determine if the response to the request is still
* valid. The most up-to-date LSN we could use for that purpose
* would be the current insert LSN, but to avoid the overhead of
* looking it up, use 'flushlsn' instead. This relies on the
* assumption that if the page was modified since the last WAL
* flush, it should still be in the buffer cache, and we
* wouldn't be requesting it.
*/
result.request_lsn = flushlsn;
result.request_lsn = UINT64_MAX;
result.not_modified_since = last_written_lsn;
result.effective_request_lsn = flushlsn;
}

return result;
Expand All @@ -1596,7 +1612,11 @@ neon_prefetch_response_usable(neon_request_lsns request_lsns,
{
/* sanity check the LSN's on the old and the new request */
Assert(request_lsns.request_lsn >= request_lsns.not_modified_since);
Assert(request_lsns.effective_request_lsn >= request_lsns.not_modified_since);
Assert(request_lsns.effective_request_lsn <= request_lsns.request_lsn);
Assert(slot->request_lsns.request_lsn >= slot->request_lsns.not_modified_since);
Assert(slot->request_lsns.effective_request_lsn >= slot->request_lsns.not_modified_since);
Assert(slot->request_lsns.effective_request_lsn <= slot->request_lsns.request_lsn);
Assert(slot->status != PRFS_UNUSED);

/*
Expand All @@ -1614,27 +1634,40 @@ neon_prefetch_response_usable(neon_request_lsns request_lsns,
* calculate LSNs "out of order" with each other, but the prefetch queue
* is backend-private at the moment.)
*/
if (request_lsns.request_lsn < slot->request_lsns.request_lsn ||
if (request_lsns.effective_request_lsn < slot->request_lsns.effective_request_lsn ||
request_lsns.not_modified_since < slot->request_lsns.not_modified_since)
{
ereport(LOG,
(errcode(ERRCODE_IO_ERROR),
errmsg(NEON_TAG "request with unexpected LSN after prefetch"),
errdetail("Request %X/%X not_modified_since %X/%X, prefetch %X/%X not_modified_since %X/%X)",
LSN_FORMAT_ARGS(request_lsns.request_lsn), LSN_FORMAT_ARGS(request_lsns.not_modified_since),
LSN_FORMAT_ARGS(slot->request_lsns.request_lsn), LSN_FORMAT_ARGS(slot->request_lsns.not_modified_since))));
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn),
LSN_FORMAT_ARGS(request_lsns.not_modified_since),
LSN_FORMAT_ARGS(slot->request_lsns.effective_request_lsn),
LSN_FORMAT_ARGS(slot->request_lsns.not_modified_since))));
return false;
}

/*---
* Each request to the pageserver carries two LSN values:
* `not_modified_since` and `request_lsn`. The (not_modified_since,
* request_lsn] range of each request is effectively a claim that the page
* has not been modified between those LSNs. If the range of the old
* request in the queue overlaps with the new request, we know that the
* page hasn't been modified in the union of the ranges. We can use the
* response to old request to satisfy the new request in that case. For
* example:
* Each request to the pageserver has three LSN values associated with it:
* `not_modified_since`, `request_lsn`, and 'effective_request_lsn'.
* `not_modified_since` and `request_lsn` are sent to the pageserver, but
* in the primary node, we always use UINT64_MAX as the `request_lsn`, so
* we remember `effective_request_lsn` separately. In a primary,
* `effective_request_lsn` is the last flush WAL position when the request
* was sent to the pageserver. That's logically the LSN that we are
* requesting the page at, but we send UINT64_MAX to the pageserver so
* that if the GC horizon advances past that position, we still get a
* valid response instead of an error.
*
* To determine whether a response to a GetPage request issued earlier is
* still valid to satisfy a new page read, we look at the
* (not_modified_since, effective_request_lsn] range of the request. It is
* effectively a claim that the page has not been modified between those
* LSNs. If the range of the old request in the queue overlaps with the
* new request, we know that the page hasn't been modified in the union of
* the ranges. We can use the response to old request to satisfy the new
* request in that case. For example:
*
* 100 500
* Old request: +--------+
Expand Down Expand Up @@ -1663,9 +1696,9 @@ neon_prefetch_response_usable(neon_request_lsns request_lsns,
*/

/* this follows from the checks above */
Assert(request_lsns.request_lsn >= slot->request_lsns.not_modified_since);
Assert(request_lsns.effective_request_lsn >= slot->request_lsns.not_modified_since);

return request_lsns.not_modified_since <= slot->request_lsns.request_lsn;
return request_lsns.not_modified_since <= slot->request_lsns.effective_request_lsn;
}

/*
Expand Down Expand Up @@ -1757,7 +1790,7 @@ neon_exists(SMgrRelation reln, ForkNumber forkNum)
errmsg(NEON_TAG "could not read relation existence of rel %u/%u/%u.%u from page server at lsn %X/%08X",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forkNum,
LSN_FORMAT_ARGS(request_lsns.request_lsn)),
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn)),
errdetail("page server returned error: %s",
((NeonErrorResponse *) resp)->message)));
break;
Expand Down Expand Up @@ -2296,7 +2329,7 @@ neon_read_at_lsn(NRelFileInfo rinfo, ForkNumber forkNum, BlockNumber blkno,
slot->shard_no, blkno,
RelFileInfoFmt(rinfo),
forkNum,
LSN_FORMAT_ARGS(request_lsns.request_lsn)),
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn)),
errdetail("page server returned error: %s",
((NeonErrorResponse *) resp)->message)));
break;
Expand Down Expand Up @@ -2566,7 +2599,7 @@ neon_nblocks(SMgrRelation reln, ForkNumber forknum)
errmsg(NEON_TAG "could not read relation size of rel %u/%u/%u.%u from page server at lsn %X/%08X",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum,
LSN_FORMAT_ARGS(request_lsns.request_lsn)),
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn)),
errdetail("page server returned error: %s",
((NeonErrorResponse *) resp)->message)));
break;
Expand All @@ -2579,7 +2612,7 @@ neon_nblocks(SMgrRelation reln, ForkNumber forknum)
neon_log(SmgrTrace, "neon_nblocks: rel %u/%u/%u fork %u (request LSN %X/%08X): %u blocks",
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum,
LSN_FORMAT_ARGS(request_lsns.request_lsn),
LSN_FORMAT_ARGS(request_lsns.effective_request_lsn),
n_blocks);

pfree(resp);
Expand Down Expand Up @@ -2619,7 +2652,7 @@ neon_dbsize(Oid dbNode)
ereport(ERROR,
(errcode(ERRCODE_IO_ERROR),
errmsg(NEON_TAG "could not read db size of db %u from page server at lsn %X/%08X",
dbNode, LSN_FORMAT_ARGS(request_lsns.request_lsn)),
dbNode, LSN_FORMAT_ARGS(request_lsns.effective_request_lsn)),
errdetail("page server returned error: %s",
((NeonErrorResponse *) resp)->message)));
break;
Expand All @@ -2629,7 +2662,7 @@ neon_dbsize(Oid dbNode)
}

neon_log(SmgrTrace, "neon_dbsize: db %u (request LSN %X/%08X): %ld bytes",
dbNode, LSN_FORMAT_ARGS(request_lsns.request_lsn), db_size);
dbNode, LSN_FORMAT_ARGS(request_lsns.effective_request_lsn), db_size);

pfree(resp);
return db_size;
Expand Down Expand Up @@ -2874,6 +2907,10 @@ neon_read_slru_segment(SMgrRelation reln, const char* path, int segno, void* buf
XLogRecPtr request_lsn,
not_modified_since;

/*
* Compute a request LSN to use, similar to neon_get_request_lsns() but the
* logic is a bit simpler.
*/
if (RecoveryInProgress())
{
request_lsn = GetXLogReplayRecPtr(NULL);
Expand All @@ -2885,10 +2922,10 @@ neon_read_slru_segment(SMgrRelation reln, const char* path, int segno, void* buf
*/
request_lsn = GetRedoStartLsn();
}
request_lsn = nm_adjust_lsn(request_lsn);
}
else
request_lsn = GetXLogInsertRecPtr();
request_lsn = nm_adjust_lsn(request_lsn);
request_lsn = UINT64_MAX;

/*
* GetRedoStartLsn() returns LSN of basebackup. We know that the SLRU
Expand Down
14 changes: 14 additions & 0 deletions pgxn/neon_test_utils/neontest.c
Original file line number Diff line number Diff line change
Expand Up @@ -312,6 +312,13 @@ get_raw_page_at_lsn(PG_FUNCTION_ARGS)

request_lsns.request_lsn = PG_ARGISNULL(3) ? GetXLogInsertRecPtr() : PG_GETARG_LSN(3);
request_lsns.not_modified_since = PG_ARGISNULL(4) ? request_lsns.request_lsn : PG_GETARG_LSN(4);
/*
* For the time being, use the same LSN for request and
* effective request LSN. If any test needed to use UINT64_MAX
* as the request LSN, we'd need to add effective_request_lsn
* as a new argument.
*/
request_lsns.effective_request_lsn = request_lsns.request_lsn;

if (!superuser())
ereport(ERROR,
Expand Down Expand Up @@ -419,6 +426,13 @@ get_raw_page_at_lsn_ex(PG_FUNCTION_ARGS)

request_lsns.request_lsn = PG_ARGISNULL(5) ? GetXLogInsertRecPtr() : PG_GETARG_LSN(5);
request_lsns.not_modified_since = PG_ARGISNULL(6) ? request_lsns.request_lsn : PG_GETARG_LSN(6);
/*
* For the time being, use the same LSN for request
* and effective request LSN. If any test needed to
* use UINT64_MAX as the request LSN, we'd need to add
* effective_request_lsn as a new argument.
*/
request_lsns.effective_request_lsn = request_lsns.request_lsn;

SET_VARSIZE(raw_page, BLCKSZ + VARHDRSZ);
raw_page_data = VARDATA(raw_page);
Expand Down

1 comment on commit 22afaea

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3141 tests run: 2995 passed, 0 failed, 146 skipped (full report)


Code coverage* (full report)

  • functions: 31.4% (6339 of 20183 functions)
  • lines: 47.4% (47965 of 101226 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
22afaea at 2024-05-14T08:02:57.141Z :recycle:

Please sign in to comment.