Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 2024-01-22 #6420

Merged
merged 37 commits into from
Jan 22, 2024
Merged

Release 2024-01-22 #6420

merged 37 commits into from
Jan 22, 2024

Conversation

vipvap
Copy link

@vipvap vipvap commented Jan 22, 2024

Release 2024-01-22

Please merge this PR using 'Create a merge commit'!

problame and others added 30 commits January 15, 2024 08:54
After PR #6325, when running without --runtime, we wouldn't wait for
start_work_barrier, causing the benchmark to not start at all.
## Problem


neondatabase/serverless#51 (comment)

## Summary of changes

1. When we have a db_error, use db_error.message() as the message.
2. include error position.
3. line should be a string (weird?)
4. `datatype` -> `dataType`
We've removed this input from the deploy-prod workflow.
Fixes #6343.

## Problem

PAGE_CACHE_ACQUIRE_PINNED_SLOT_TIME is used on hot path and it adds
noticeable latency to GetPage@LSN.

## Refs

https://discordapp.com/channels/1176467419317940276/1195022264115151001/1196370689268125716
## Problem

Using the same domain name () for serverless driver can help with
connection caching.
#6290

## Summary of changes

Relax SNI check.
The theme of the changes in this PR is that they're enablers for #6251
which are superficial struct/api changes.

This is a spinoff from #6251:
- Various APIs + clients thereof take TenantShardId rather than TenantId
- The creation API gets a ShardParameters member, which may be used to
configure shard count and stripe size. This enables the attachment
service to present a "virtual pageserver" creation endpoint that creates
multiple shards.
- The attachment service will use tenant size information to drive shard
splitting. Make a version of `TenantHistorySize` that is usable for
decoding these API responses.
- ComputeSpec includes a shard stripe size.
## Problem

In #5980 the page service connection handler gets a simple piece of
logic for finding the right Timeline: at connection time, it picks an
arbitrary Timeline, and then when handling individual page requests it
checks if the original timeline is the correct shard, and if not looks
one up.

This is pretty slow in the case where we have to go look up the other
timeline, because we take the big tenants manager lock.

## Summary of changes

- Add a `shard_timelines` map of ShardIndex to Timeline on the page
service connection handler
- When looking up a Timeline for a particular ShardIndex, consult
`shard_timelines` to avoid hitting the TenantsManager unless we really
need to.
- Re-work the CancellationToken handling, because the handler now holds
gateguards on multiple timelines, and so must respect cancellation of
_any_ timeline it has in its cache, not just the timeline related to the
request it is currently servicing.

---------

Co-authored-by: Vlad Lazar <vlad@neon.tech>
Follows #6123 

Closes: #5342

The approach here is to avoid using `Layer` from secondary tenants, and
instead make the eviction types (e.g. `EvictionCandidate`) have a
variant that carries a Layer for attached tenants, and a different
variant for secondary tenants.

Other changes:
- EvictionCandidate no longer carries a `Timeline`: this was only used
for providing a witness reference to remote timeline client.
- The types for returning eviction candidates are all in
disk_usage_eviction_task.rs now, whereas some of them were in
timeline.rs before.
- The EvictionCandidate type replaces LocalLayerInfoForDiskUsageEviction
type, which was basically the same thing.
This implements the `copy` operation for azure blobs, added to S3 by
#6091, and adds tests both to s3 and azure ensuring that the copy
operation works.
The remote_storage crate contains two copies of each test, one for azure
and one for S3. The repetition is not necessary and makes the tests more
prone to drift, so we remove it by moving the tests into a shared
module.

The module has a different name depending on where it is included, so
that each test still has "s3" or "azure" in its full path, allowing you
to just test the S3 test or just the azure tests.

Earlier PR that removed some duplication already: #6176

Fixes #6146.
Previously, if we:

1. created a new timeline B from a different timeline's A initdb
2. deleted timeline A

the initdb for timeline B would be gone, at least in a world where we
are deleting initdbs upon timeline deletion. This world is imminent
(#6226).

Therefore, if the pageserver is instructed to load the initdb from a
different timeline ID, copy it to the newly created timeline's directory
in S3. This ensures that we can disaster recover the new timeline as
well, regardless of whether the original timeline was deleted or not.

Part of #5282.
## Problem

The `/v1/tenant` listing API only applies to attached tenants.

For an external service to implement a global reconciliation of its list
of shards vs. what's on the pageserver, we need a full view of what's in
TenantManager, including secondary tenant locations, and InProgress
locations.

Dependency of #6251

## Summary of changes

- Add methods to Tenant and SecondaryTenant to reconstruct the
LocationConf used to create them.
- Add `GET /v1/location_config` API
## Problem

To test sharding, we need something to control it. We could write python
code for doing this from the test runner, but this wouldn't be usable
with neon_local run directly, and when we want to write tests with large
number of shards/tenants, Rust is a better fit efficiently handling all
the required state.

This service enables automated tests to easily get a system with
sharding/HA without the test itself having to set this all up by hand:
existing tests can be run against sharded tenants just by setting a
shard count when creating the tenant.

## Summary of changes

Attachment service was previously a map of TenantId->TenantState, where
the principal state stored for each tenant was the generation and the
last attached pageserver. This enabled it to serve the re-attach and
validate requests that the pageserver requires.

In this PR, the scope of the service is extended substantially to do
overall management of tenants in the pageserver, including
tenant/timeline creation, live migration, evacuation of offline
pageservers etc. This is done using synchronous code to make declarative
changes to the tenant's intended state (`TenantState.policy` and
`TenantState.intent`), which are then translated into calls into the
pageserver by the `Reconciler`.

Top level summary of modules within
`control_plane/attachment_service/src`:
- `tenant_state`: structure that represents one tenant shard.
- `service`: implements the main high level such as tenant/timeline
creation, marking a node offline, etc.
- `scheduler`: for operations that need to pick a pageserver for a
tenant, construct a scheduler and call into it.
- `compute_hook`: receive notifications when a tenant shard is attached
somewhere new. Once we have locations for all the shards in a tenant,
emit an update to postgres configuration via the neon_local `LocalEnv`.
- `http`: HTTP stubs. These mostly map to methods on `Service`, but are
separated for readability and so that it'll be easier to adapt if/when
we switch to another RPC layer.
- `node`: structure that describes a pageserver node. The most important
attribute of a node is its availability: marking a node offline causes
tenant shards to reschedule away from it.

This PR is a precursor to implementing the full sharding service for
prod (#6342). What's the difference between this and a production-ready
controller for pageservers?
- JSON file persistence to be replaced with a database
- Limited observability.
- No concurrency limits. Marking a pageserver offline will try and
migrate every tenant to a new pageserver concurrently, even if there are
thousands.
- Very simple scheduler that only knows to pick the pageserver with
fewest tenants, and place secondary locations on a different pageserver
than attached locations: it does not try to place shards for the same
tenant on different pageservers. This matters little in tests, because
picking the least-used pageserver usually results in round-robin
placement.
- Scheduler state is rebuilt exhaustively for each operation that
requires a scheduler.
- Relies on neon_local mechanisms for updating postgres: in production
this would be something that flows through the real control plane.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem


Currently relation hash size is limited by "neon.relsize_hash_size" GUC
with default value 64k.
64k relations is not so small number... but it is enough to create 376
databases to exhaust it.

## Summary of changes

Use LRU replacement algorithm to prevent hash overflow

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
I just failed to see this earlier on #6136. layer counts are used as an
abstraction, and each of the two tenants lose proportionally about the
same amount of layers. sadly there is no difference in between
`relative_spare` and `relative_equal` as both of these end up evicting
the exact same amount of layers, but I'll try to add later another test
for those.

Cc: #5304
## Problem

tenant_id/timeline_id is no longer a full identifier for metrics from a
`Tenant` or `Timeline` object.

Closes: #5953

## Summary of changes

Include `shard_id` label everywhere we have `tenant_id`/`timeline_id`
label.
- Start pgbouncer in VM from postgres user, to allow connection to
pgbouncer admin console.
- Remove unused compute_ctl options --pgbouncer-connstr
and --pgbouncer-ini-path.
- Fix and cleanup code of connection to pgbouncer, add retries
because pgbouncer may not be instantly ready when compute_ctl starts.
## Problem

Use [NEON_SMGR] for all log messages produced by neon extension.

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
…bsolute (#6384)

With testing the new eviction order there is a problem of all of the
(currently rare) disk usage based evictions being rare and unique; this
PR adds a human readable summary of what absolute order would had done
and what the relative order does. Assumption is that these loggings will
make the few evictions runs in staging more useful.

Cc: #5304 for allowing testing in the staging
Before this patch, pagebench get-page-latest-lsn would sometimes cause
noisy errors in pageserver log about `CopyFail` protocol message.

refs #6390
In
7f82889
we changed the logic for persisting control_files. Previously it was
updated if `peer_horizon_lsn` jumped more than one segment, which made
`peer_horizon_lsn` initialized on disk as soon as safekeeper has
received a first `AppendRequest`.

This caused an issue with `truncateLsn`, which now can be zero
sometimes. This PR fixes it, and now `truncateLsn/peer_horizon_lsn` can
never be zero once we know `timeline_start_lsn`.

Closes #6248
## Problem

For PRs with `run-benchmarks` label, we don't upload results to the db,
making it harder to debug such tests. The only way to see some
numbers is by examining GitHub Action output which is really
inconvenient.
This PR adds zenbenchmark metrics to Allure reports.

## Summary of changes
- Create a json file with zenbenchmark results and attach it to allure
report
In the most straightforward way; safekeeper performs it in DELETE endpoint
implementation, with no coordination between sks.

delete_force endpoint in the code is renamed to delete as there is only one way
to delete.
…#6388)

From #6037 on, until this patch, if the client opens the connection but
doesn't send a `PagestreamFeMessage` within the first 10ms, we'd close
the connection because `self.timeline_cancelled()` returns.
It returns because `self.shard_timelines` is still empty at that point:
it gets filled lazily within the handlers for the incoming messages.

Changes
-------

The question is: if we can't check for timeline cancellation, what else
do we need to be cancellable for? `tenant.cancel` is also a bad choice
because the `tenant` (shard) we pick at the top of handle_pagerequests
might indeed go away over the course of the connection lifetime, but
other shards may still be there.

The correct solution, I think, is to be responsive to task_mgr
cancellation, because the connection handler runs in a task_mgr task and
it is already the current canonical way how we shut down a tenant's /
timelin's page_service connections (see `Tenant::shutdown` /
`Timeline::shutdown`).

So, rename the function and make it sensitive to task_mgr cancellation.
## Problem

Some fields were missed in the initial spec.

## Summary of changes

Adds a success boolean (defaults to false unless specifically marked as
successful).
Adds a duration_us integer that tracks how many microseconds were taken
from session start through to request completion.
## Problem

Name in notifications is not compatible with console name.

## Summary of changes

Rename fields to make it compatible.
configures nextest to kill tests after 1 minute. slow period is set to
20s which is how long our tests currently take in total, there will be 2
warnings and then the test will be killed and it's output logged.

Cc: #6361
Cc: #6368 -- likely this will be enough for longer time, but it will be
counter productive when we want to attach and debug; the added line
would have to be commented out.
update_next_xid() doesn't have any special treatment for the invalid or
other special XIDs, so it will treat InvalidTransactionId (0) as a
regular XID. If old nextXid is smaller than 2^31, 0 will look like a
very old XID, and nothing happens. But if nextXid is greater than 2^31 0
will look like a very new XID, and update_next_xid() will incorrectly
bump up nextXID.
khanova and others added 3 commits January 20, 2024 16:14
## Problem

Currently we store in cache even if the project is undefined. That makes
invalidation impossible.

## Summary of changes

Do not store if project id is empty.
## Problem

In #6283 I did a couple changes
that weren't directly related to the goal of extracting the state
machine, so I'm putting them here

## Summary of changes

- move postgres vs console provider into another enum
- reduce error cases for link auth
- slightly refactor link flow
## Problem

See https://neondb.slack.com/archives/C06F5UJH601/p1705731304237889

Adding 1 to xid in `update_next_xid` can cause overflow in debug mode.
0xffffffff is valid transaction ID.

## Summary of changes

Use `wrapping_add` 

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
@vipvap vipvap requested review from a team as code owners January 22, 2024 06:01
@vipvap vipvap requested review from arssher, khanova, conradludgate, jcsp and mtyazici and removed request for a team January 22, 2024 06:01
Copy link

github-actions bot commented Jan 22, 2024

2268 tests run: 2178 passed, 0 failed, 90 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_statvfs_pressure_usage: debug

Postgres 15

  • test_secondary_mode_eviction: release

Code coverage (full report)

  • functions: 54.6% (10598 of 19424 functions)
  • lines: 81.5% (60659 of 74431 lines)

The comment gets automatically updated with the latest test results
90e689a at 2024-01-22T17:04:08.146Z :recycle:

## Problem

https://rustsec.org/advisories/RUSTSEC-2024-0006

## Summary of changes

`cargo update -p shlex`

(cherry picked from commit 5559b16)
@jcsp
Copy link
Collaborator

jcsp commented Jan 22, 2024

The test failure is #6385, retrying

khanova and others added 3 commits January 22, 2024 14:39
## Problem

Gc currently doesn't work properly.

## Summary of changes

Change statement on running gc.
Before this patch, the select! still retured immediately if `futs` was
empty. Must have tested a stale build in my manual testing of #6388.

(cherry picked from commit 15c0df4)
## Problem

When a tenant is in Attaching state, and waiting for the
`concurrent_tenant_warmup` semaphore, it also listens for the tenant
cancellation token. When that token fires, Tenant::attach drops out.
Meanwhile, Tenant::set_stopping waits forever for the tenant to exit
Attaching state.

Fixes: #6423

## Summary of changes

- In the absence of a valid state for the tenant, it is set to Broken in
this path. A more elegant solution will require more refactoring, beyond
this minimal fix.

(cherry picked from commit 93572a3)
@jcsp jcsp merged commit a1a74ee into release Jan 22, 2024
46 checks passed
@jcsp jcsp deleted the releases/2024-01-22 branch January 22, 2024 17:24
@danieltprice
Copy link
Contributor

Reviewed for Friday changelog. The only item for the changelog appears to be the serverless driver error messaging improvement. Please let me know if I missed something that should be mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.