Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy release 2024-07-04 #8258

Closed
wants to merge 49 commits into from
Closed

Proxy release 2024-07-04 #8258

wants to merge 49 commits into from

Conversation

vipvap
Copy link

@vipvap vipvap commented Jul 4, 2024

Proxy release 2024-07-04

Please merge this Pull Request using 'Create a merge commit' button

VladLazar and others added 30 commits June 27, 2024 11:56
## Problem
The fill requirement was not taken into account when looking through the
shards of a given node to fill from.

## Summary of Changes
Ensure that we do not fill a node past the recommendation from
`Scheduler::compute_fill_requirement`.
- Add --safekeepers option to neon_local reconfigure
- Add it to python Endpoint reconfigure
- Implement config reload in walproposer by restarting the whole bgw when
  safekeeper list changes.

ref #6341
## Problem
The cache keys that we use on CI are the same for X64 and ARM64
(`runner.arch`)

## Summary of changes
- Include `runner.arch` along with `runner.os` into cache keys
## Problem
Background node operations take a long time for loaded nodes.

## Summary of changes
Increase number of concurrent reconciles an operation is allowed to
spawn.
This should make drain and fill operations faster and the new value is
still well below the total limit of concurrent reconciles.
While investigating Pageserver logs from the cases where systemd hangs
during shutdown (neondatabase/cloud#11387), I
noticed that even if Pageserver shuts down cleanly[^1], there are
lingering walredo processes.

[^1]: Meaning, pageserver finishes its shutdown procedure and calls
`exit(0)` on its own terms, instead of hitting the systemd unit's
`TimeoutSec=` limit and getting SIGKILLed.

While systemd should never lock up like it does, maybe we can avoid
hitting that bug by cleaning up properly.

Changes
-------

This PR adds a shutdown method to `WalRedoManager` and hooks it up to
tenant shutdown.

We keep track of intent to shutdown through the new `enum
ProcessOnceCell` stored inside the pre-existing `redo_process` field.
A gate is added to keep track of running processes, using the new type
`struct Process`.

Future Work
-----------

Requests that don't need the redo process will not observe the shutdown
(see doc comment).
Doing so would be nice for completeness sake, but doesn't provide much
benefit because `Tenant` and `Timeline` already shut down all walredo
users.

Testing
-------


I did manual testing to confirm that the problem exists before this PR
and that it's gone after.
Setup:
* `neon_local` with a single tenant, create some data using `pgbench`
* ensure walredo process is running, not pid
* watch `strace -e kill,wait4 -f -p "$(pgrep pageserver)"`
* `neon_local pageserver stop`

With this PR, we always observe

```
$ strace -e kill,wait4 -f -p "$(pgrep pageserver)"
...
[pid 591120] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=591215, si_uid=1000} ---
[pid 591134] kill(591174, SIGKILL)      = 0
[pid 591134] wait4(591174,  <unfinished ...>
[pid 591142] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=591174, si_uid=1000, si_status=SIGKILL, si_utime=0, si_stime=0} ---
[pid 591134] <... wait4 resumed>[{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, NULL) = 591174
...
+++ exited with 0 +++
```

Before this PR, we'd usually observe just

```
...
[pid 596239] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=596455, si_uid=1000} ---
...
+++ exited with 0 +++
```

Refs
----

refs neondatabase/cloud#11387
part of #8002

## Summary of changes

Add delta layer iterator and tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
After #8022 was deployed to
staging, I noticed many cases of timeouts. After inspecting the logs, I
realized that some operations are taking ~20 seconds and they're doing
while holding shared state lock. Usually it happens right after
redeploy, because compute reconnections put high load on disks. This
commit tries to improve observability around slow operations.

Non-observability changes:
- `TimelineState::finish_change` now skips update if nothing has changed
- `wal_residence_guard()` timeout is set to 30s
This was missed in my previous attempt to mark every connection string
with an application name. See 0c3e3a8.
## Problem

For some time, we have created tenants with calls to location_conf. The
legacy "POST /v1/tenant" path was only used in some tests.

## Summary of changes

- Remove the API
- Relocate TenantCreateRequest to the controller API file (this used to
be used in both pageserver and controller APIs)
- Rewrite tenant_create test helper to use location_config API, as
control plane and storage controller do
- Update docker-compose test script to create tenants with
location_config API (this small commit is also present in
#7947)
…at_n`, fix UB for engine `std-fs` (#8186)

part of #7418

I reviewed how the VirtualFile API's `read` methods look like and came
to the conclusion that we've been using `IoBufMut` / `BoundedBufMut` /
`Slice` wrong.

This patch rectifies the situation.

# Change 1: take `tokio_epoll_uring::Slice` in the read APIs

Before, we took an `IoBufMut`, which is too low of a primitive and while
it _seems_ convenient to be able to pass in a `Vec<u8>` without any
fuzz, it's actually very unclear at the callsite that we're going to
fill up that `Vec` up to its `capacity()`, because that's what
`IoBuf::bytes_total()` returns and that's what
`VirtualFile::read_exact_at` fills.

By passing a `Slice` instead, a caller that "just wants to read into a
`Vec`" is forced to be explicit about it, adding either `slice_full()`
or `slice(x..y)`, and these methods panic if the read is outside of the
bounds of the `Vec::capacity()`.

Last, passing slices is more similar to what the `std::io` APIs look
like.

# Change 2: fix UB in `virtual_file_io_engine=std-fs`

While reviewing call sites, I noticed that the
`io_engine::IoEngine::read_at` method for `StdFs` mode has been
constructing an `&mut[u8]` from raw parts that were uninitialized.

We then used `std::fs::File::read_exact` to initialize that memory, but,
IIUC we must not even be constructing an `&mut[u8]` where some of the
memory isn't initialized.

So, stop doing that and add a helper ext trait on `Slice` to do the
zero-initialization.

# Change 3: eliminate  `read_exact_at_n`

The `read_exact_at_n` doesn't make sense because the caller can just

1. `slice = buf.slice()` the exact memory it wants to fill 
2. `slice = read_exact_at(slice)`
3. `buf = slice.into_inner()`

Again, the `std::io` APIs specify the length of the read via the Rust
slice length.
We should do the same for the owned buffers IO APIs, i.e., via
`Slice::bytes_total()`.

# Change 4: simplify filling of `PageWriteGuard`

The `PageWriteGuardBuf::init_up_to` was never necessary.
Remove it. See changes to doc comment for more details.

---

Reviewers should probably look at the added test case first, it
illustrates my case a bit.
In #8188 I forgot to specify buckets for new operations metrics. This
commit fixes that.
We hit that bug in a new test being added in PR #6528. We'd get the fix
from upstream with the next minor release anyway, but cherry-pick it now
to unblock PR #6528.

Upstream commit b1ffe3ff0b.

See
#6528 (comment)
…ed (#8198)

## Problem

Very long running downloads can be wasteful, because the heatmap they're
using is outdated after a few minutes.

Closes: #8182

## Summary of changes

- Impose a deadline on timeline downloads, using the same period as we
use for scheduling, and returning an UpdateError::Restart when it is
reached. This restart will involve waiting for a scheduling interval,
but that's a good thing: it helps let other tenants proceed.
- Refactor download_timeline so that the part where we update the state
for local layers is done even if we fall out of the layer download loop
with an error: this is important, especially for big tenants, because
only layers in the SecondaryDetail state will be considered for
eviction.
…8197)

## Problem

At the start of do_tenant_shard_split, we drop any secondary location
for the parent shards. The reconciler uses presence of secondary
locations as a condition for enabling heatmaps.

On the pageserver, child shards inherit their configuration from
parents, but the storage controller assumes the child's ObservedState is
the same as the parent's config from the prepare phase. The result is
that some child shards end up with inaccurate ObservedState, and until
something next migrates or restarts, those tenant shards aren't
uploading heatmaps, so their secondary locations are downloading
everything that was resident at the moment of the split (including
ancestor layers which are often cleaned up shortly after the split).

Closes: #8189

## Summary of changes

- Use PlacementPolicy to control enablement of heatmap upload, rather
than the literal presence of secondaries in IntentState: this way we
avoid switching them off during shard split
- test: during tenant split test, assert that the child shards have
heatmap uploads enabled.
rename the tenant test harness name

Signed-off-by: Alex Chi Z <chi@neon.tech>
…6528)

Whenever we see an XLOG_MULTIXACT_CREATE_ID WAL record, we need to
update the nextMulti and NextMultiOffset fields in the pageserver's
copy of the CheckPoint struct, to cover the new multi-XID. In
PostgreSQL, this is done by updating an in-memory struct during WAL
replay, but because in Neon you can start a compute node at any LSN,
we need to have an up-to-date value pre-calculated in the pageserver
at all times. We do the same for nextXid.

However, we had a bug in WAL ingestion code that does that: the
multi-XIDs will wrap around at 2^32, just like XIDs, so we need to do
the comparisons in a wraparound-aware fashion.

Fix that, and add tests.

Fixes issue #6520

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
The 'running' boolean was replaced with a semaphore in commit
f0e2bb7, but this initialization was missed. Remove it so that if a
test tries to access it, you get an error rather than always claiming
that the endpoint is not running.

Spotted by Arseny at
#7288 (comment)
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the
comments: we need to be back off to the beginning of a page if the
previous record ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.

The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).

However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:

* It changes the historical behavior for existing users. Currently,
  the standby immediately opens up for queries, so if they now need to
  wait, we can breka existing use cases that were working fine
  (assuming you don't hit the MVCC issues).

* The problem is much worse for Neon than it is for standalone
  PostgreSQL, because in Neon, we can start a replica from an
  arbitrary LSN. In standalone PostgreSQL, the replica always starts
  WAL replay from a checkpoint record, and the primary arranges things
  so that there is always a running-xacts record soon after each
  checkpoint record. You can still hit this issue with PostgreSQL if
  you have a transaction with lots of subtransactions running in the
  primary, but it's pretty rare in practice.

To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.

See #7236.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
…8204)

## Problem

We lack visibility of how much local disk space is used by secondary
tenant locations

Close: #8181

## Summary of changes

- Add `pageserver_secondary_resident_physical_size`, tagged by tenant
- Register & de-register label sets from SecondaryTenant
- Add+use wrappers in SecondaryDetail that update metrics when
adding+removing layers/timelines
## Problem
We use `build-tools` image as a base image to build other images, and it
has a pretty old `libpq-dev` installed (v13; it wasn't that old until I
removed system Postgres 14 from `build-tools` image in
#6540)

## Summary of changes
- Remove `libpq-dev` from `build-tools` image
- Set `LD_LIBRARY_PATH` for tests (for different Postgres binaries that
we use, like psql and pgbench)
- Set `PQ_LIB_DIR` to build Storage Controller
- Set `LD_LIBRARY_PATH`/`DYLD_LIBRARY_PATH` in the Storage Controller
where it calls Postgres binaries
Extracted from #6560, currently
we include multiple copies of aux files in the basebackup.

## Summary of changes

Fix the loop.

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
I've missed setting `PQ_LIB_DIR` in
#8206 in
`gather-rust-build-stats` job and it fails now:
```
  = note: /usr/bin/ld: cannot find -lpq
          collect2: error: ld returned 1 exit status
          

error: could not compile `storage_controller` (bin "storage_controller") due to 1 previous error
```

https://github.com/neondatabase/neon/actions/runs/9743960062/job/26888597735

## Summary of changes
- Set `PQ_LIB_DIR` for `gather-rust-build-stats` job
…8215)

This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the comments:
we need to be back off to the beginning of a page if the previous record
ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
#7288 (comment)
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related #7387
## Problem

See neondatabase/cloud#14289

## Summary of changes

Check connection status after calling PQconnectStartParams

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
…llers (#8193)

Before this PR, `RemoteStorageConfig::from_toml` would support
deserializing an
empty `{}` TOML inline table to a `None`, otherwise try `Some()`.

We can instead let
* in proxy: let clap derive handle the Option
* in PS & SK: assume that if the field is specified, it must be a valid
  RemtoeStorageConfig

(This PR started with a much simpler goal of factoring out the
`deserialize_item` function because I need that in another PR).
## Problem

Tenant attachment has error paths for failures to write local
configuration, but these types of local storage I/O errors should be
considered fatal for the process. Related thread on an earlier PR that
touched this code:
#7947 (comment)

## Summary of changes

- Make errors writing tenant config fatal (abort process)
- When reading tenant config, make all I/O errors except ENOENT fatal
- Replace use of bare anyhow errors with `LoadConfigError`
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
jcsp and others added 15 commits July 2, 2024 17:17
I forgot a commit when merging
#8177
Signed-off-by: Alex Chi Z <chi@neon.tech>
…ait on socket event without a socket' error (#8231)

## Problem

See neondatabase/cloud#14289
and PR #8210 

## Summary of changes

Add test for problems fixed in #8210

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem

We will run this tool in the k8s cluster. To make it accessible from
k8s, we need to package it into the docker image.

part of neondatabase/cloud#14024
All the code to ensure the WAL record lands at a page boundary was
unnecessary for reproducing the original problem. In fact, it's a pretty
basic test that checks that outbound replication (= neon as publisher)
still works after restarting the endpoint. It just used to be very
broken before commit 5ceccdc, which also added this test.

To verify that:

1. Check out commit f3af5f4 (because the next commit, 7dd58e1,
fixed the same bug in a different way, making it infeasible to revert
the bug fix in an easy way)
2. Revert the bug fix from commit 5ceccdc with this:

```
diff --git a/pgxn/neon/walproposer_pg.c b/pgxn/neon/walproposer_pg.c
index 7debb6325..9f03bbd99 100644
--- a/pgxn/neon/walproposer_pg.c
+++ b/pgxn/neon/walproposer_pg.c
@@ -1437,8 +1437,10 @@ XLogWalPropWrite(WalProposer *wp, char *buf, Size nbytes, XLogRecPtr recptr)
 	 *
 	 * #5749
 	 */
+#if 0
 	if (!wp->config->syncSafekeepers)
 		XLogUpdateWalBuffers(buf, recptr, nbytes);
+#endif

 	while (nbytes > 0)
 	{
```

3. Run the test_wal_page_boundary_start regression test. It fails, as
expected

4. Apply this commit to the test, and run it again. It still fails, with
the same error mentioned in issue #5749:

```
PG:2024-06-30 20:49:08.805 GMT [1248196] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.567 GMT [1467972] LOG:  starting logical decoding for slot "sub1"
PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL:  Streaming transactions committing after 0/1532330, reading WAL from 0/1531C78.
PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.567 GMT [1467972] LOG:  logical decoding found consistent point at 0/1531C78
PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL:  There are no running transactions.
PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.568 GMT [1467972] ERROR:  could not find record while sending logically-decoded data: invalid contrecord length 312 (expected 6) at 0/1533FD8
```
## Problem

GitHub Actions complain that we use actions that depend on deprecated
Node 16:

```
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: docker/setup-buildx-action@v2
```

But also, the latest `docker/setup-buildx-action` fails with the following
error:
```
/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175
            throw new Error(`Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.`);
^
Error: Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.
    at Object.rejected (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175:1)
    at Generator.next (<anonymous>)
    at fulfilled (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:29:1)
```

We can work this around by setting `cache-binary: false` for `uses:
docker/setup-buildx-action@v3`

## Summary of changes
- Update `docker/setup-buildx-action` from `v2` to `v3`, set
`cache-binary: false`
- Update `docker/login-action` from `v2` to `v3`
- Update `docker/build-push-action` from `v4`/`v5` to `v6`
## Problem

test_location_conf_churn fails on log errors when it tries to shutdown a
pageserver immediately after starting a tenant attach, like this:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8224/9761000525/index.html#/testresult/15fb6beca5c7327c

```
shutdown:shutdown{tenant_id=35f5c55eb34e7e5e12288c5d8ab8b909 shard_id=0000}:timeline_shutdown{timeline_id=30936747043353a98661735ad09cbbfe shutdown_mode=FreezeAndFlush}: failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited\n')
```

This is happening because Tenant::shutdown fires its cancellation token
early if the tenant is not fully attached by the time shutdown is
called, so the flush loop is shutdown by the time we try and flush.

## Summary of changes

- In the early-cancellation case, also set the shutdown mode to Hard to
skip trying to do a flush that will fail.
…8238)

PR #8106 was created with the assumption that no blob is larger than
`256 MiB`. Due to #7852 we have checking for *writes* of blobs larger
than that limit, but we didn't have checking for *reads* of such large
blobs: in theory, we could be reading these blobs every day but we just
don't happen to write the blobs for some reason.

Therefore, we now add a warning for *reads* of such large blobs as well.

To make deploying compression less dangerous, we therefore only assume a
blob is compressed if the compression setting is present in the config.
This also means that we can't back out of compression once we enabled
it.

Part of #5431
…#8233)

we want to run some specific pagebench test cases on dedicated hardware
to get reproducible results

run1: 1 client per tenant => characterize throughput with n tenants.
-  500 tenants
- scale 13 (200 MB database)
- 1 hour duration
- ca 380 GB layer snapshot files

run2.singleclient: 1 client per tenant => characterize latencies
run2.manyclient: N clients per tenant => characterize throughput
scalability within one tenant.
- 1 tenant with 1 client for latencies
- 1 tenant with 64 clients because typically for a high number of
connections we recommend the connection pooler
which by default uses 64 connections (for scalability)
- scale 136 (2048 MB database)
- 20 minutes each
## Problem

We record detailed histograms for all page_service op types, which
mostly aren't very interesting, but make our prometheus scrapes huge.

Closes: #8223 

## Summary of changes

- Only track GetPageAtLsn histograms on a per-timeline granularity. For
all other operation types, rely on existing node-wide histograms.
Commit 97b48c2 changes the log wait timeout from 1 second to 100
milliseconds but forgets to update the comment.
I'd like to add some constraints to the layer map we generate in tests.

(1) is the layer map that the current compaction algorithm will produce.
There is a property that for all delta layer, all delta layer overlaps
with it on the LSN axis will have the same LSN range.
(2) is the layer map that cannot be produced with the legacy compaction
algorithm.
(3) is the layer map that will be produced by the future
tiered-compaction algorithm. The current validator does not allow that
but we can modify the algorithm to allow it in the future.

## Summary of changes

Add a validator to check if the layer map is valid and refactor the test
cases to include delta layer start/end LSN.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
## Problem
Scale test doesn't exercise drain & fill.

## Summary of changes
Make scale test exercise drain & fill
@vipvap vipvap requested review from a team as code owners July 4, 2024 06:02
@vipvap vipvap requested review from skyzh, petuhovskiy, khanova and nikitakalyanov and removed request for a team July 4, 2024 06:02
Copy link

github-actions bot commented Jul 4, 2024

3000 tests run: 2885 passed, 0 failed, 115 skipped (full report)


Flaky tests (1)

Postgres 16

  • test_statvfs_pressure_usage: debug

Code coverage* (full report)

  • functions: 32.7% (6936 of 21208 functions)
  • lines: 50.1% (54476 of 108664 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
bbb2fa7 at 2024-07-04T06:49:14.869Z :recycle:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet