Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attach-time tenant config #4255

Merged
merged 21 commits into from
May 24, 2023
Merged

Conversation

problame
Copy link
Contributor

@problame problame commented May 16, 2023

This PR adds support for supplying the tenant config upon /attach.

Before this change, when relocating a tenant using /detach and /attach, the tenant config after /attach would be the default config from pageserver.toml.
That is undesirable for settings such as the PITR-interval: if the tenant's config on the source was 30 days and the default config on the attach-side is 7 days, then the first GC run would eradicate 23 days worth of PITR capability.

The API change is backwards-compatible: if the body is empty, we continue to use the default config.
We'll remove that capability as soon as the cloud.git code is updated to use attach-time tenant config (#4282 keeps track of this).

unblocks https://github.com/neondatabase/cloud/issues/5092
fixes #1555
part of #886 (Tenant Relocation)

Implementation

The preliminary PRs for this work were (most-recent to least-recent)

…cessing

With this patch, the attach handler now follows the same pattern as
tenant create with regards to instantiation of the new tenant:

1. Prepare on-disk state using `create_tenant_files`.
2. Use the same code path as pageserver startup to load it
   into memory and start background loops
   (`schedule_local_tenant_processing`).

It's a bit sad we can't use the
`PageServerConfig::tenant_attaching_mark_file_path`
method inside `create_tenant_files` because it operates
in a temporary directory. However, it's a small price to pay for the
gained simplicity.

During implementation, I noticed that we don't handle
failures post `create_tenant_files` well. I left TODO
comments in the code linking to the issue that I created
for this [^1].

Also, I'll dedupe the spawn_load and spawn_attach code
in a future commit.

[^1]: #4233
And make error message wording fit both create & attach use case.
This patch DRYs up the pageserver OpenAPI YAML's representation of
tenant config.

All the fields of tenant config are now located in a model schema
called TenantConfig.

The tenant create & config-change endpoints have separate schemas,
TenantCreateInfo and TenantConfigureArg, respectively.
These schemas inherit from TenantConfig, using allOf 1.

The tenant config-GET handler's response was previously named TenantConfig.
It's now named TenantConfigResponse.

None of these changes affect how the request looks on the wire.

The generated Go code will change for Console because the OpenAPI code
generator maps `allOf` to a Go struct embedding.
Luckily, usage of tenant config in Console is still very lightweigt,
but that will change in the near future.
So, this is a good chance to set things straight.

The console changes are tracked in
 neondatabase/cloud#5046
The only difference is that the TenantConfigRequest impl does

```
        tenant_conf.max_lsn_wal_lag = request_data.max_lsn_wal_lag;
        tenant_conf.trace_read_requests = request_data.trace_read_requests;
```

and the TenantCreateRequest impl does

```
        if let Some(max_lsn_wal_lag) = request_data.max_lsn_wal_lag {
            tenant_conf.max_lsn_wal_lag = Some(max_lsn_wal_lag);
        }
        if let Some(trace_read_requests) = request_data.trace_read_requests {
            tenant_conf.trace_read_requests = Some(trace_read_requests);
        }
```

As far as I can tell, these are identical.
@github-actions
Copy link

github-actions bot commented May 16, 2023

1016 tests run: 975 passed, 0 failed, 41 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_close_on_connections_exit[endpoint]: ✅ debug
The comment gets automatically updated with the latest test results
a1d9759 at 2023-05-22T14:44:41.381Z :recycle:

@problame
Copy link
Contributor Author

The code is not ready yet, but, @NanoBjorn and @ololobus , please review the proposed API spec change: https://github.com/neondatabase/neon/pull/4255/files#diff-9e123da50d2f96564abb1afd1c860db064259c5ade92953ce72a295db7843c13

Base automatically changed from problame/api-shared-tenant-config-model to main May 17, 2023 10:31
problame added a commit that referenced this pull request May 17, 2023
…PI (#4252)

This is prep for #4255

[1/X] OpenAPI: share a single definition of TenantConfig

DRYs up the pageserver OpenAPI YAML's representation of
tenant config.

All the fields of tenant config are now located in a model schema
called TenantConfig.

The tenant create & config-change endpoints have separate schemas,
TenantCreateInfo and TenantConfigureArg, respectively.
These schemas inherit from TenantConfig, using allOf 1.

The tenant config-GET handler's response was previously named
TenantConfig.
It's now named TenantConfigResponse.

None of these changes affect how the request looks on the wire.

The generated Go code will change for Console because the OpenAPI code
generator maps `allOf` to a Go struct embedding.
Luckily, usage of tenant config in Console is still very lightweigt,
but that will change in the near future.
So, this is a good chance to set things straight.

The console changes are tracked in
 neondatabase/cloud#5046

[2/x]: extract the tenant config parts of create & config requests

[3/x]: code movement: move TenantConfigRequestConfig next to
TenantCreateRequestConfig

[4/x] type-alias TenantConfigRequestConfig = TenantCreateRequestConfig;
They are exactly the same.

[5/x] switch to qualified use for tenant create/config request api
models

[6/x] rename models::TenantConfig{RequestConfig,} and remove the alias

[7/x] OpenAPI: sync tenant create & configure body names from Rust code

[8/x]: dedupe the two TryFrom<...> for TenantConfOpt impls
The only difference is that the TenantConfigRequest impl does

```
        tenant_conf.max_lsn_wal_lag = request_data.max_lsn_wal_lag;
        tenant_conf.trace_read_requests = request_data.trace_read_requests;
```

and the TenantCreateRequest impl does

```
        if let Some(max_lsn_wal_lag) = request_data.max_lsn_wal_lag {
            tenant_conf.max_lsn_wal_lag = Some(max_lsn_wal_lag);
        }
        if let Some(trace_read_requests) = request_data.trace_read_requests {
            tenant_conf.trace_read_requests = Some(trace_read_requests);
        }
```

As far as I can tell, these are identical.
skyzh added a commit that referenced this pull request May 19, 2023
This PR enforces that the tenant create / update-config APIs reject
requests with unknown fields.

This is a desirable property because some tenant config settings control
the lifetime of user data (e.g., GC horizon or PITR interval).

Suppose we inadvertently rename the `pitr_interval` field in the Rust
code.
Then, right now, a client that still uses the old name will send a
tenant config request to configure a new PITR interval.
Before this PR, we would accept such a request, ignore the old name
field, and use the pageserver.toml default value for what the new PITR
interval is.
With this PR, we will instead reject such a request.

One might argue that the client could simply check whether the config it
sent has been applied, using the `/v1/tenant/.../config` endpoint.
That is correct for tenant create and update-config.

But, attach will soon [^1] grow the ability to have attach-time config
as well.
If we ignore unknown fields and fall back to global defaults in that
case, we risk data loss.
Example:
1. Default PITR in pageservers is 7 days.
2. Create a tenant and set its PITR to 30 days.
3. For 30 days, fill the tenant continuously with data.
4. Detach the tenant.
5. Attach tenant.

Attach must use the 30-day PITR setting in this scenario.
If it were to fall back to the 7-day default value, we would lose 23
days of PITR capability for the tenant.

So, the PR that adds attach-time tenant config will build on the
(clunky) infrastructure added in this PR

[^1]: #4255

Implementation Notes
====================

This could have been a simple `#[serde(deny_unknown_fields)]` but sadly,
that is documented- but silent-at-compile-time-incompatible with
`#[serde(flatten)]`. But we are still using this by adding on outer struct and use unit tests to ensure it is correct.

`neon_local tenant config` now uses the `.remove()` pattern + bail if
there are leftover config args. That's in line with what
`neon_local tenant create` does. We should dedupe that logic in a future
PR.

---------

Signed-off-by: Alex Chi <iskyzh@gmail.com>
Co-authored-by: Alex Chi <iskyzh@gmail.com>
fails with

E       psycopg2.errors.InternalError_: Tenant a1fa9cc383e32ddafb73ff920de5f2e6 will not become active. Current state: Broken due to: Failed to parse config from file '/home/cs/src/neon/test_output/test_tenant_config[debug-pg14]/repo/tenants/a1fa9cc383e32ddafb73ff920de5f2e6/config' as pageserver config: configure option max_lsn_wal_lag is not a string. Backtrace:

So, not even the assertions added are necessary.
But let's have them anyway.
@problame problame changed the base branch from main to problame/fix-tenant-conf-max-wal-lag May 19, 2023 13:34
@problame problame self-assigned this May 19, 2023
@problame problame changed the title WIP: attach-time tenant config attach-time tenant config May 19, 2023
@problame problame marked this pull request as ready for review May 19, 2023 14:04
@problame problame requested review from a team as code owners May 19, 2023 14:04
@problame problame requested review from lubennikovaav and removed request for a team and lubennikovaav May 19, 2023 14:04
Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor nitpicky commets

libs/pageserver_api/src/models.rs Show resolved Hide resolved
libs/utils/src/http/json.rs Show resolved Hide resolved
@problame
Copy link
Contributor Author

I'm waiting for e2e build failures to be resolved, see https://neondb.slack.com/archives/C03438W3FLZ/p1684747576440749 to

Base automatically changed from problame/fix-tenant-conf-max-wal-lag to main May 23, 2023 13:30
problame added a commit that referenced this pull request May 23, 2023
This patch fixes parsing of the `max_lsn_wal_lag` tenant config item.
We were incorrectly expecting a string before, but the type is a
NonZeroU64.

So, when setting it in the config, the (updated) test case would fail
with

```
 E       psycopg2.errors.InternalError_: Tenant a1fa9cc383e32ddafb73ff920de5f2e6 will not become active. Current state: Broken due to: Failed to parse config from file '.../repo/tenants/a1fa9cc383e32ddafb73ff920de5f2e6/config' as pageserver config: configure option max_lsn_wal_lag is not a string. Backtrace:
```

So, not even the assertions added are necessary.

The test coverage for tenant config is rather thin in general.
For example, the `test_tenant_conf.py` test doesn't cover all the
options.

I'll add a new regression test as part of attach-time-tenant-conf PR
#4255
@problame problame enabled auto-merge (squash) May 23, 2023 18:47
@problame problame merged commit df52587 into main May 24, 2023
@problame problame deleted the problame/attach-time-tenant-config-phase2 branch May 24, 2023 14:46
problame added a commit that referenced this pull request Sep 28, 2023
As of #4255
all newly attached tenants have a config file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix tenant config & tenant relocation integration
4 participants