feat(compute_ctl): add periodic `lease lsn` requests for static computes #7994

prepor · 2024-06-07T11:52:52Z

Part of #7497

Problem

Static computes pinned at some fix LSN could be created initially within PITR interval but eventually go out it. To make sure that Static computes are not affected by GC, we need to start using the LSN lease API (introduced in #8084) in compute_ctl.

Summary of changes

compute_ctl

Spawn a thread for when a static compute starts to periodically ping pageserver(s) to make LSN lease requests.
Add test_readonly_node_gc to test if static compute can read all pages without error.
- (test will fail on main without the code change here)

page_service

wait_or_get_last_lsn will now allow request_lsn less than latest_gc_cutoff_lsn to proceed if there is a lease on request_lsn.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-06-07T12:37:05Z

2120 tests run: 2051 passed, 0 failed, 69 skipped (full report)

Flaky tests (1)

Postgres 16

test_branch_and_gc: release

Code coverage* (full report)

functions: 32.4% (7164 of 22144 functions)
lines: 50.4% (57920 of 115011 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
3c988d6 at 2024-08-12T15:30:12.809Z :recycle:}

libs/compute_api/src/spec.rs

compute_tools/src/compute.rs

pageserver/src/page_service.rs

jcsp · 2024-06-14T10:00:14Z

Testing:

test_ondemand_download_replica failed with a message from this new code -- but I don't see that test setting an explicit LSN? Pretty weird, seems like it might be activating incorrectly somehow?
Do we already have a test that sets up a static endpoint, which would exercise the code in this PR?

compute_tools/src/compute.rs

prepor · 2024-06-17T12:23:03Z

@jcsp about test_ondemand_download_replica test. I think it's a bug in the test, it actually creates static endpoint instead of replica. https://github.com/neondatabase/neon/blob/main/control_plane/src/bin/neon_local.rs#L816

For hot_standby=false (default) it creates Static. And it doesn't make sense to provide lsn if you expect Replica.

here is the test

neon/test_runner/regress/test_ondemand_slru_download.py

Line 118 in b6e1c09

# Start standby at this point in time

prepor · 2024-06-19T16:35:39Z

@hlinnaka could you please check my assamption that the test doesn't do what is expected #7994 (comment)?

hlinnaka · 2024-06-19T16:46:44Z

It's indeed testing on-demand SLRU download in a static endpoint, rather than a replica that would follow the primary. That's intentional, but I agree the naming is misleading.

Is a "static endpoint" a "replica"? Or is it a replica only if it follows the primary? Our terminology is not very well defined.

And it would be good to also test on-demand SLRU download in a hot standby replica that follows the primary. We don't have a test for that currently.

compute_tools/src/lsn_lease.rs

yliang412 · 2024-06-21T00:13:15Z

More investigation on failed tests:

test_readonly_node tried to create an endpoint before GC cutoff. Besides the basebackup failure, lsn lease request would fail as well. We should add it to allowed errors:

    env.pageserver.allowed_errors.extend(
        [
            ".*basebackup .* failed: invalid basebackup lsn.*",
            ".*page_service.*error obtaining lsn lease.*.*tried to request a page version that was garbage collected",
        ]
    )

test_ondemand_download_replica failed for the shard=4 case with an error saying NotFound("Tenant <tenant_id> not found"). Is there anything specific for sharding that could cause this problem allure?

…8254) ## Problem LSN Leases introduced in #8084 is a new API that is made shard-aware from day 1. To support ephemeral endpoint in #7994 without linking Postgres C API against `compute_ctl`, part of the sharding needs to reside in `utils`. ## Summary of changes - Create a new `shard` module in utils crate. - Move more interface related part of tenant sharding API to utils and re-export them in pageserver_api. Signed-off-by: Yuchen Liang <yuchen@neon.tech>

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

…8254) ## Problem LSN Leases introduced in #8084 is a new API that is made shard-aware from day 1. To support ephemeral endpoint in #7994 without linking Postgres C API against `compute_ctl`, part of the sharding needs to reside in `utils`. ## Summary of changes - Create a new `shard` module in utils crate. - Move more interface related part of tenant sharding API to utils and re-export them in pageserver_api. Signed-off-by: Yuchen Liang <yuchen@neon.tech>

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

skyzh · 2024-08-02T18:26:43Z

pageserver/src/page_service.rs

+                });
+            }
+            tracing::info!(
+                "requesting a leased lsn {} below gc cutoff {}",


so every read request will print a line of info?

...and having a read lock here seems costly, not sure if it is a good idea to change latest_gc_cutoff_lsn Rcu into latest_gc_lease_cutoff_lsn?

...and having a read lock here seems costly, not sure if it is a good idea to change latest_gc_cutoff_lsn Rcu into latest_gc_lease_cutoff_lsn?

Are you talking about the gc_info read lock? We could do latest_gc_lease_cutoff_lsn but don't know what value it should be. It cannot be the lowest leased LSN because we don't know if everything between latest_gc_cutoff_lsn and latest_gc_lease_cutoff_lsn are valid.

info log removal: 3c988d6

Like with the logging, we will only inspect the read lock when we are actually below the gc cutoff which is already rare, so I don't think the read lock taking will be significant.

test_runner/regress/test_readonly_node.py

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

koivunej · 2024-08-12T16:06:32Z

compute_tools/src/lsn_lease.rs

+    thread::spawn(move || {
+        if let Err(e) = lsn_lease_bg_task(compute, lsn) {
+            // TODO: might need stronger error feedback than logging an warning.
+            warn!("lsn_lease_bg_task failed: {e}");


So initially like this, but surely we should kill the compute eventually? Perhaps something to consider is that until we do kill it, it might be good keep re-trying the lease.

Posted some similar thoughts bellow

ololobus

LGTM overall, but left some comments to consider. Let me know what do you think

ololobus · 2024-08-19T16:55:09Z

compute_tools/src/lsn_lease.rs

+
+            let spec = state.pspec.as_ref().expect("spec must be set");
+
+            let configs = postgres_configs_from_state(&state);


List of pageservers is dynamic and can be reconfigured, so that's right that we refresh it. Yet, we do it before acquire_lsn_lease_with_retry, which I think makes it almost useless

Imagine the case

We fetched list of shards

It got changed

In acquire_lsn_lease_with_retry we will hit the offline pageserver, so retries won't help. We will exhaust the MAX_ATTEMPTS and exit thread with error

I'm thinking that it'd be better to refresh the list of shards/confis before each attempt vs. in each lease iteration

What do you think?

And NIT here for the code structure: logic from the postgres_configs_from_state can be just inlined into acquire_lsn_lease_with_retry

ololobus · 2024-08-19T17:07:34Z

compute_tools/src/lsn_lease.rs

+            .max(valid_duration / 2);
+
+        info!(
+            "lsn_lease_request succeeded, sleeping for {} seconds",


NIT: lsn_lease_request here and in the message above looks like a func/method name, but it isn't

ololobus · 2024-08-19T17:32:36Z

compute_tools/src/lsn_lease.rs

+        lsn: Lsn,
+    ) -> Result<SystemTime> {
+        let mut client = config.connect(NoTls)?;
+        let cmd = format!("lease lsn {} {} {} ", tenant_shard_id, timeline_id, lsn);


Is there anything in the pageserver lease lsn API that can signal the caller that it's impossible to obtain the lease?

This is actually a broader question about the retry mechanism. If I calculate it right, 10 attempts with current backoff are slightly under 1 min, and we have seen in the past some pageservers being unresponsive >1 min due to restart, iirc

I'd be much more confident in a simpler yet persistent logic -- just catch and retry any errors indefinitely, BUT only if it's not a permanent error, i.e. we already raced with GC, so acquiring the lease is impossible.

ololobus · 2024-08-19T17:34:11Z

compute_tools/src/lsn_lease.rs

+    thread::spawn(move || {
+        if let Err(e) = lsn_lease_bg_task(compute, lsn) {
+            // TODO: might need stronger error feedback than logging an warning.
+            warn!("lsn_lease_bg_task failed: {e}");


Posted some similar thoughts bellow

ololobus · 2024-08-19T17:39:46Z

compute_tools/src/lsn_lease.rs

+            ComputeMode::Static(lsn) => lsn,
+            _ => return,


Started writing that to propose you to add a compute feature flag and use it here
https://github.com/neondatabase/neon/blob/6949b45e1795816507f5025a474e15d718e07456/libs/compute_api/src/spec.rs#L34C23-L34C37

But then realized that currently, the control plane doesn't start static endpoints at all, and there is a separate feature flag to start using them static_ephemeral_endpoints
https://github.com/neondatabase/cloud/blob/cfdadee070fa3503048ef7242da50f26bed1c4b0/goapp/internal/dto/account_settings.go#L146

And it means when we enable it -- your code will kick in, but when it's disabled, it shouldn't affect any other workload

So leaving it just for the context, maybe someone didn't know about these flags :)

prepor requested review from a team as code owners June 7, 2024 11:52

prepor requested review from mattpodraza and knizhnik June 7, 2024 11:52

skyzh requested a review from yliang412 June 7, 2024 18:33

prepor added 2 commits June 10, 2024 14:45

feat(compute_ctl): add periodic lease lsn requests for static computes

ef90a62

linter fix

3d1bcc9

prepor force-pushed the prepor/lsn_leasing branch from 291b2d3 to 3d1bcc9 Compare June 10, 2024 12:45

koivunej assigned yliang412 Jun 10, 2024

yliang412 reviewed Jun 10, 2024

View reviewed changes

libs/compute_api/src/spec.rs Outdated Show resolved Hide resolved

feedback: using valid_until (it doesn't work currently)

ff55d6f

yliang412 mentioned this pull request Jun 12, 2024

fix: update lsn lease binary protocol to use text API. #8039

Closed

5 tasks

fixes to protocol based on #8039

3b21593

prepor requested a review from a team as a code owner June 13, 2024 08:52

prepor requested a review from petuhovskiy June 13, 2024 08:52

jcsp reviewed Jun 14, 2024

View reviewed changes

compute_tools/src/compute.rs Outdated Show resolved Hide resolved

jcsp reviewed Jun 14, 2024

View reviewed changes

compute_tools/src/compute.rs Outdated Show resolved Hide resolved

jcsp reviewed Jun 14, 2024

View reviewed changes

pageserver/src/page_service.rs Show resolved Hide resolved

jcsp reviewed Jun 14, 2024

View reviewed changes

compute_tools/src/compute.rs Outdated Show resolved Hide resolved

arpad-m reviewed Jun 14, 2024

View reviewed changes

compute_tools/src/compute.rs Outdated Show resolved Hide resolved

compute_tools/src/compute.rs Outdated Show resolved Hide resolved

prepor added 2 commits June 17, 2024 15:40

feedback

66d0277

feedback: reread connstrings

06f953a

Merge branch 'main' into prepor/lsn_leasing

8baaccb

yliang412 reviewed Jun 21, 2024

View reviewed changes

compute_tools/src/lsn_lease.rs Outdated Show resolved Hide resolved

yliang412 and others added 2 commits July 8, 2024 14:22

Merge branch 'main' into prepor/lsn_leasing

f634ecc

import TenantShardId from utils instead

79f9161

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

yliang412 force-pushed the prepor/lsn_leasing branch from a3a07db to 79f9161 Compare July 8, 2024 18:39

yliang412 and others added 2 commits July 8, 2024 14:48

fix import

8ad25a0

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

Merge branch 'main' into prepor/lsn_leasing

8c5ad25

yliang412 marked this pull request as draft July 31, 2024 13:18

yliang412 removed request for knizhnik, petuhovskiy and mattpodraza July 31, 2024 13:19

yliang412 and others added 2 commits July 31, 2024 09:22

Merge branch 'main' into prepor/lsn_leasing

d52a43b

clean up error handling

f335b19

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

ololobus mentioned this pull request Aug 1, 2024

Add retroactive RFC about physical replication #8546

Merged

yliang412 added 2 commits August 1, 2024 17:08

consider leased lsn below gc cutoff

5c7b422

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

test gc with static compute protected by lsn lease

2d9267f

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

yliang412 force-pushed the prepor/lsn_leasing branch from c051403 to 2d9267f Compare August 1, 2024 21:09

yliang412 and others added 2 commits August 1, 2024 17:11

Merge branch 'main' into prepor/lsn_leasing

8fdbda6

fix clippy

7f2f1ea

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

yliang412 marked this pull request as ready for review August 2, 2024 12:56

ololobus self-requested a review August 2, 2024 13:18

skyzh reviewed Aug 2, 2024

View reviewed changes

koivunej reviewed Aug 4, 2024

View reviewed changes

test_runner/regress/test_readonly_node.py Outdated Show resolved Hide resolved

koivunej reviewed Aug 4, 2024

View reviewed changes

test_runner/regress/test_readonly_node.py Outdated Show resolved Hide resolved

yliang412 added 2 commits August 12, 2024 10:29

review: put comments as assertion messages

9b6837f

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

review: remove info log on success path

3c988d6

Signed-off-by: Yuchen Liang <yuchen@neon.tech>

yliang412 requested a review from koivunej August 12, 2024 14:34

koivunej reviewed Aug 12, 2024

View reviewed changes

koivunej approved these changes Aug 12, 2024

View reviewed changes

ololobus reviewed Aug 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compute_ctl): add periodic `lease lsn` requests for static computes #7994

feat(compute_ctl): add periodic `lease lsn` requests for static computes #7994

prepor commented Jun 7, 2024 •

edited by yliang412

Loading

github-actions bot commented Jun 7, 2024 •

edited

Loading

Postgres 16

jcsp commented Jun 14, 2024

prepor commented Jun 17, 2024

prepor commented Jun 19, 2024

hlinnaka commented Jun 19, 2024

yliang412 commented Jun 21, 2024

skyzh Aug 2, 2024

skyzh Aug 2, 2024

yliang412 Aug 2, 2024

yliang412 Aug 12, 2024

koivunej Aug 12, 2024

koivunej Aug 12, 2024 •

edited

Loading

ololobus Aug 19, 2024

ololobus left a comment

ololobus Aug 19, 2024

ololobus Aug 19, 2024

ololobus Aug 19, 2024

ololobus Aug 19, 2024

ololobus Aug 19, 2024

ololobus Aug 19, 2024


		let spec = state.pspec.as_ref().expect("spec must be set");

		let configs = postgres_configs_from_state(&state);

feat(compute_ctl): add periodic lease lsn requests for static computes #7994

Are you sure you want to change the base?

feat(compute_ctl): add periodic lease lsn requests for static computes #7994

Conversation

prepor commented Jun 7, 2024 • edited by yliang412 Loading

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Jun 7, 2024 • edited Loading

2120 tests run: 2051 passed, 0 failed, 69 skipped (full report)

Postgres 16

Code coverage* (full report)

jcsp commented Jun 14, 2024

prepor commented Jun 17, 2024

prepor commented Jun 19, 2024

hlinnaka commented Jun 19, 2024

yliang412 commented Jun 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koivunej Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ololobus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feat(compute_ctl): add periodic `lease lsn` requests for static computes #7994

feat(compute_ctl): add periodic `lease lsn` requests for static computes #7994

prepor commented Jun 7, 2024 •

edited by yliang412

Loading

github-actions bot commented Jun 7, 2024 •

edited

Loading

koivunej Aug 12, 2024 •

edited

Loading