Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(compute_ctl): add periodic lease lsn requests for static computes #7994

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

prepor
Copy link
Contributor

@prepor prepor commented Jun 7, 2024

Part of #7497

Problem

Static computes pinned at some fix LSN could be created initially within PITR interval but eventually go out it. To make sure that Static computes are not affected by GC, we need to start using the LSN lease API (introduced in #8084) in compute_ctl.

Summary of changes

compute_ctl

  • Spawn a thread for when a static compute starts to periodically ping pageserver(s) to make LSN lease requests.
  • Add test_readonly_node_gc to test if static compute can read all pages without error.
    • (test will fail on main without the code change here)

page_service

  • wait_or_get_last_lsn will now allow request_lsn less than latest_gc_cutoff_lsn to proceed if there is a lease on request_lsn.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@prepor prepor requested review from a team as code owners June 7, 2024 11:52
Copy link

github-actions bot commented Jun 7, 2024

2120 tests run: 2051 passed, 0 failed, 69 skipped (full report)


Flaky tests (1)

Postgres 16

Code coverage* (full report)

  • functions: 32.4% (7164 of 22144 functions)
  • lines: 50.4% (57920 of 115011 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
3c988d6 at 2024-08-12T15:30:12.809Z :recycle:

@skyzh skyzh requested a review from yliang412 June 7, 2024 18:33
@prepor prepor requested a review from a team as a code owner June 13, 2024 08:52
@prepor prepor requested a review from petuhovskiy June 13, 2024 08:52
compute_tools/src/compute.rs Outdated Show resolved Hide resolved
compute_tools/src/compute.rs Outdated Show resolved Hide resolved
@jcsp
Copy link
Contributor

jcsp commented Jun 14, 2024

Testing:

  • test_ondemand_download_replica failed with a message from this new code -- but I don't see that test setting an explicit LSN? Pretty weird, seems like it might be activating incorrectly somehow?
  • Do we already have a test that sets up a static endpoint, which would exercise the code in this PR?

compute_tools/src/compute.rs Outdated Show resolved Hide resolved
compute_tools/src/compute.rs Outdated Show resolved Hide resolved
compute_tools/src/compute.rs Outdated Show resolved Hide resolved
@prepor
Copy link
Contributor Author

prepor commented Jun 17, 2024

@jcsp about test_ondemand_download_replica test. I think it's a bug in the test, it actually creates static endpoint instead of replica. https://github.com/neondatabase/neon/blob/main/control_plane/src/bin/neon_local.rs#L816

For hot_standby=false (default) it creates Static. And it doesn't make sense to provide lsn if you expect Replica.

here is the test

# Start standby at this point in time

@prepor
Copy link
Contributor Author

prepor commented Jun 19, 2024

@hlinnaka could you please check my assamption that the test doesn't do what is expected #7994 (comment)?

@hlinnaka
Copy link
Contributor

It's indeed testing on-demand SLRU download in a static endpoint, rather than a replica that would follow the primary. That's intentional, but I agree the naming is misleading.

Is a "static endpoint" a "replica"? Or is it a replica only if it follows the primary? Our terminology is not very well defined.

And it would be good to also test on-demand SLRU download in a hot standby replica that follows the primary. We don't have a test for that currently.

@yliang412
Copy link
Contributor

More investigation on failed tests:

  • test_readonly_node tried to create an endpoint before GC cutoff. Besides the basebackup failure, lsn lease request would fail as well. We should add it to allowed errors:
    env.pageserver.allowed_errors.extend(
        [
            ".*basebackup .* failed: invalid basebackup lsn.*",
            ".*page_service.*error obtaining lsn lease.*.*tried to request a page version that was garbage collected",
        ]
    )
  • test_ondemand_download_replica failed for the shard=4 case with an error saying NotFound("Tenant <tenant_id> not found"). Is there anything specific for sharding that could cause this problem allure?

yliang412 added a commit that referenced this pull request Jul 8, 2024
…8254)

## Problem

LSN Leases introduced in #8084 is a new API that is made shard-aware
from day 1. To support ephemeral endpoint in #7994 without linking
Postgres C API against `compute_ctl`, part of the sharding needs to
reside in `utils`.

## Summary of changes

- Create a new `shard` module in utils crate.
- Move more interface related part of tenant sharding API to utils and
re-export them in pageserver_api.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
yliang412 and others added 2 commits July 8, 2024 14:48
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
skyzh pushed a commit that referenced this pull request Jul 15, 2024
…8254)

## Problem

LSN Leases introduced in #8084 is a new API that is made shard-aware
from day 1. To support ephemeral endpoint in #7994 without linking
Postgres C API against `compute_ctl`, part of the sharding needs to
reside in `utils`.

## Summary of changes

- Create a new `shard` module in utils crate.
- Move more interface related part of tenant sharding API to utils and
re-export them in pageserver_api.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
@yliang412 yliang412 marked this pull request as draft July 31, 2024 13:18
yliang412 and others added 2 commits July 31, 2024 09:22
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
yliang412 and others added 2 commits August 1, 2024 17:11
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
@yliang412 yliang412 marked this pull request as ready for review August 2, 2024 12:56
@ololobus ololobus self-requested a review August 2, 2024 13:18
});
}
tracing::info!(
"requesting a leased lsn {} below gc cutoff {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so every read request will print a line of info?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and having a read lock here seems costly, not sure if it is a good idea to change latest_gc_cutoff_lsn Rcu into latest_gc_lease_cutoff_lsn?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and having a read lock here seems costly, not sure if it is a good idea to change latest_gc_cutoff_lsn Rcu into latest_gc_lease_cutoff_lsn?

Are you talking about the gc_info read lock? We could do latest_gc_lease_cutoff_lsn but don't know what value it should be. It cannot be the lowest leased LSN because we don't know if everything between latest_gc_cutoff_lsn and latest_gc_lease_cutoff_lsn are valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

info log removal: 3c988d6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like with the logging, we will only inspect the read lock when we are actually below the gc cutoff which is already rare, so I don't think the read lock taking will be significant.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
thread::spawn(move || {
if let Err(e) = lsn_lease_bg_task(compute, lsn) {
// TODO: might need stronger error feedback than logging an warning.
warn!("lsn_lease_bg_task failed: {e}");
Copy link
Member

@koivunej koivunej Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So initially like this, but surely we should kill the compute eventually? Perhaps something to consider is that until we do kill it, it might be good keep re-trying the lease.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted some similar thoughts bellow

Copy link
Member

@ololobus ololobus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, but left some comments to consider. Let me know what do you think


let spec = state.pspec.as_ref().expect("spec must be set");

let configs = postgres_configs_from_state(&state);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List of pageservers is dynamic and can be reconfigured, so that's right that we refresh it. Yet, we do it before acquire_lsn_lease_with_retry, which I think makes it almost useless

Imagine the case

  1. We fetched list of shards
  2. It got changed
  3. In acquire_lsn_lease_with_retry we will hit the offline pageserver, so retries won't help. We will exhaust the MAX_ATTEMPTS and exit thread with error

I'm thinking that it'd be better to refresh the list of shards/confis before each attempt vs. in each lease iteration

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And NIT here for the code structure: logic from the postgres_configs_from_state can be just inlined into acquire_lsn_lease_with_retry

.max(valid_duration / 2);

info!(
"lsn_lease_request succeeded, sleeping for {} seconds",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: lsn_lease_request here and in the message above looks like a func/method name, but it isn't

lsn: Lsn,
) -> Result<SystemTime> {
let mut client = config.connect(NoTls)?;
let cmd = format!("lease lsn {} {} {} ", tenant_shard_id, timeline_id, lsn);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything in the pageserver lease lsn API that can signal the caller that it's impossible to obtain the lease?

This is actually a broader question about the retry mechanism. If I calculate it right, 10 attempts with current backoff are slightly under 1 min, and we have seen in the past some pageservers being unresponsive >1 min due to restart, iirc

I'd be much more confident in a simpler yet persistent logic -- just catch and retry any errors indefinitely, BUT only if it's not a permanent error, i.e. we already raced with GC, so acquiring the lease is impossible.

thread::spawn(move || {
if let Err(e) = lsn_lease_bg_task(compute, lsn) {
// TODO: might need stronger error feedback than logging an warning.
warn!("lsn_lease_bg_task failed: {e}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted some similar thoughts bellow

Comment on lines +26 to +27
ComputeMode::Static(lsn) => lsn,
_ => return,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started writing that to propose you to add a compute feature flag and use it here
https://github.com/neondatabase/neon/blob/6949b45e1795816507f5025a474e15d718e07456/libs/compute_api/src/spec.rs#L34C23-L34C37

But then realized that currently, the control plane doesn't start static endpoints at all, and there is a separate feature flag to start using them static_ephemeral_endpoints
https://github.com/neondatabase/cloud/blob/cfdadee070fa3503048ef7242da50f26bed1c4b0/goapp/internal/dto/account_settings.go#L146

And it means when we enable it -- your code will kick in, but when it's disabled, it shouldn't affect any other workload

So leaving it just for the context, maybe someone didn't know about these flags :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants