Release 2024-02-28 #6956

vipvap · 2024-02-28T16:43:47Z

Release 2024-02-28

Please merge this Pull Request using 'Create a merge commit' button

It's been dead-code-at-runtime for 9 months, let's remove it. We can always re-introduce it at a later point. Came across this while working on #6861, which will touch `time_for_new_image_layer`. This is an opporunity to make that function simpler.

over-compensating the addition of a new per-timeline metric in #6834 part of #6737

…e backwards (#6862) This PR enforces aspects of `Timeline::repartition` that were already true at runtime: - it's not called concurrently, so, bail out if it is anyway (see comment why it's not called concurrently) - the `lsn` should never be moving backwards over the lifetime of a Timeline object, because last_record_lsn() can only move forwards over the lifetime of a Timeline object The switch to tokio::sync::Mutex blows up the size of the `partitioning` field from 40 bytes to 72 bytes on Linux x86_64. That would be concerning if it was a hot field, but, `partitioning` is only accessed every 20s by one task, so, there won't be excessive cache pain on it. (It still sucks that it's now >1 cache line, but I need the Send-able MutexGuard in the next PR) part of #6861

## Problem Previously we always wrote out both legacy and modern tenant config files. The legacy write enabled rollbacks, but we are long past the point where that is needed. We still need the legacy format for situations where someone is running tenants without generations (that will be yanked as well eventually), but we can avoid writing it out at all if we do have a generation number set. We implicitly also avoid writing the legacy config if our mode is Secondary (secondary mode is newer than generations). ## Summary of changes - Make writing legacy tenant config conditional on there being no generation number set.

* decreases checkpointing and compaction targets for even more layer files * write 10 thousand rows 2 times instead of writing 20 thousand rows 1 time so that there is more to GC. Before it was noisily jumping between 1 and 0 layer files, now it's jumping between 19 and 20 layer files. The 0 caused an assertion error that gave the test most of its flakiness. * larger timeout for the churn while failpoints are active thread: this is mostly so that the test is more robust on systems with more load Fixes #3051

## Problem We want to report how much cache was used and what the limit was. ## Summary of changes Added one more query to sql_exporter to expose `neon.file_cache_size_limit`.

## Problem Customers should be able to determine the size of their workload's working set to right size their compute. Since Neon uses Local file cache (LFC) instead of shared buffers on bigger compute nodes to cache pages we need to externalize a means to determine LFC hit ratio in addition to shared buffer hit ratio. Currently the following end user documentation https://github.com/neondatabase/website/blob/fb7cd3af0e90b74bad8c2ef1166e7798bfdefe20/content/docs/manage/endpoints.md?plain=1#L137 is wrong because it describes how to right size a compute node based on shared buffer hit ratio. Note that the existing functionality in extension "neon" is NOT available to end users but only to superuser / cloud_admin. ## Summary of changes - externalize functions and views in neon extension to end users - introduce a new view `NEON_STAT_FILE_CACHE` with the following DDL ```sql CREATE OR REPLACE VIEW NEON_STAT_FILE_CACHE AS WITH lfc_stats AS ( SELECT stat_name, count FROM neon_get_lfc_stats() AS t(stat_name text, count bigint) ), lfc_values AS ( SELECT MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE NULL END) AS file_cache_misses, MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE NULL END) AS file_cache_hits, MAX(CASE WHEN stat_name = 'file_cache_used' THEN count ELSE NULL END) AS file_cache_used, MAX(CASE WHEN stat_name = 'file_cache_writes' THEN count ELSE NULL END) AS file_cache_writes, -- Calculate the file_cache_hit_ratio within the same CTE for simplicity CASE WHEN MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE 0 END) + MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END) = 0 THEN NULL ELSE ROUND((MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END)::DECIMAL / (MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END) + MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE 0 END))) * 100, 2) END AS file_cache_hit_ratio FROM lfc_stats ) SELECT file_cache_misses, file_cache_hits, file_cache_used, file_cache_writes, file_cache_hit_ratio from lfc_values; ``` This view can be used by an end user as follows: ```sql CREATE EXTENSION NEON; SELECT * from neon. NEON_STAT_FILE_CACHE" ``` The output looks like the following: ``` select * from NEON_STAT_FILE_CACHE; file_cache_misses | file_cache_hits | file_cache_used | file_cache_writes | file_cache_hit_ratio -------------------+-----------------+-----------------+-------------------+---------------------- 2133643 | 108999742 | 607 | 10767410 | 98.08 (1 row) ``` ## Checklist before requesting a review - [x ] I have performed a self-review of my code. - [x ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [x ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist

…* metrics (#6909) Use the remote_timeline_client metrics instead, they work for layer file uploads and are reasonable close to what the `pageserver_created_persistent_*` metrics were. Should we wait for empty upload queue before calling `report_size()`? part of #6737

## Problem > Approaching total cache storage limit (9.25 GB of 10 GB Used) > Least recently used caches will be automatically evicted to limit the total cache storage to 10 GB. [Learn more about cache usage.](https://docs.github.com/actions/using-workflows/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy) From https://github.com/neondatabase/neon/actions/caches Some of these caches are from closed/merged PRs. ## Summary of changes - Add a workflow that deletes caches for closed branches

## Problem README.md is missing cleanup instructions ## Summary of changes Add cleanup instructions Add instructions how to handle errors during initialization --------- Co-authored-by: Andreas Scherbaum <andreas@neon.tech>

## Problem Attachment service does not do auth based on JWT scopes. ## Summary of changes Do JWT based permission checking for requests coming into the attachment service. Requests into the attachment service must use different tokens based on the endpoint: * `/control` and `/debug` require `admin` scope * `/upcall` requires `generations_api` scope * `/v1/...` requires `pageserverapi` scope Requests into the pageserver from the attachment service must use `pageserverapi` scope.

## Problem Following up #6885, only alter default privileges when the public schema exists. Signed-off-by: Alex Chi Z <chi@neon.tech>

fixes #6889 # Problem The failure in the last 3 flaky runs on `main` is ``` test_runner/regress/test_remote_storage.py:460: in test_remote_timeline_client_calls_started_metric churn("a", "b") test_runner/regress/test_remote_storage.py:457: in churn assert gc_result["layers_removed"] > 0 E assert 0 > 0 ``` That's this code https://github.com/neondatabase/neon/blob/cd449d66ea29ad2d7269458e90623c3ae40e1816/test_runner/regress/test_remote_storage.py#L448-L460 So, the test expects GC to remove some layers but the GC doesn't. # Fix My impression is that the VACUUM isn't re-using pages aggressively enough, but I can't really prove that. Tried to analyze the layer map dump but it's too complex. So, this PR: - Creates more churn by doing the overwrite twice. - Forces image layer creation. It also drive-by removes the redundant call to timeline_compact, because, timeline_checkpoint already does that internally.

## Problem LFC has high impact on Neon application performance but there is no way for user to check efficiency of its usage ## Summary of changes Show LFC statistic in EXPLAIN ANALYZE ## Description **Local file cache (LFC)** A layer of caching that stores frequently accessed data from the storage layer in the local memory of the Neon compute instance. This cache helps to reduce latency and improve query performance by minimizing the need to fetch data from the storage layer repeatedly. **Externalization of LFC in explain output** Then EXPLAIN ANALYZE output is extended to display important counts for local file cache (LFC) hits and misses. This works both, for EXPLAIN text and json output. **File cache: hits** Whenever the Postgres backend retrieves a page/block from SGMR, it is not found in shared buffer but the page is already found in the LFC this counter is incremented. **File cache: misses** Whenever the Postgres backend retrieves a page/block from SGMR, it is not found in shared buffer and also not in then LFC but the page is retrieved from Neon storage (page server) this counter is incremented. Example (for explain text output) ```sql explain (analyze,buffers,prefetch,filecache) select count(*) from pgbench_accounts; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=214486.94..214486.95 rows=1 width=8) (actual time=5195.378..5196.034 rows=1 loops=1) Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346 Prefetch: hits=0 misses=1865 expired=0 duplicates=0 File cache: hits=141826 misses=1865 -> Gather (cost=214486.73..214486.94 rows=2 width=8) (actual time=5195.366..5196.025 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346 Prefetch: hits=0 misses=1865 expired=0 duplicates=0 File cache: hits=141826 misses=1865 -> Partial Aggregate (cost=213486.73..213486.74 rows=1 width=8) (actual time=5187.670..5187.670 rows=1 loops=3) Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346 Prefetch: hits=0 misses=1865 expired=0 duplicates=0 File cache: hits=141826 misses=1865 -> Parallel Index Only Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.43..203003.02 rows=4193481 width=0) (actual time=0.574..4928.995 rows=3333333 loops=3) Heap Fetches: 3675286 Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346 Prefetch: hits=0 misses=1865 expired=0 duplicates=0 File cache: hits=141826 misses=1865 ``` The json output uses the following keys and provides integer values for those keys: ``` ... "File Cache Hits": 141826, "File Cache Misses": 1865 ... ``` ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

## Problem We want to show connection counts to console users. ## Summary of changes Start exporting connection counts grouped by database name and connection state.

Noticed that we are failing to handle `Result::Err` when entering a gate for logical size calculation. Audited rest of the gate enters, which seem fine, unified two instances. Noticed that the gate guard allows to remove a failpoint, then noticed that adjacent failpoint was blocking the executor thread instead of using `pausable_failpoint!`, fix both. eviction_task.rs now maintains a gate guard as well. Cc: #4733

## Problem Data team cannot distinguish between cold start and not cold start. ## Summary of changes Report `is_cold_start` to analytics. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>

## Problem ## Summary of changes Updates the neon.tech link to point to a /github page in order to correctly attribute visits originating from the repo.

Rebased version of #5234, part of #6768 This consists of three parts: 1. A refactoring and new contract for implementing and testing compaction. The logic is now in a separate crate, with no dependency on the 'pageserver' crate. It defines an interface that the real pageserver must implement, in order to call the compaction algorithm. The interface models things like delta and image layers, but just the parts that the compaction algorithm needs to make decisions. That makes it easier unit test the algorithm and experiment with different implementations. I did not convert the current code to the new abstraction, however. When compaction algorithm is set to "Legacy", we just use the old code. It might be worthwhile to convert the old code to the new abstraction, so that we can compare the behavior of the new algorithm against the old one, using the same simulated cases. If we do that, have to be careful that the converted code really is equivalent to the old. This inclues only trivial changes to the main pageserver code. All the new code is behind a tenant config option. So this should be pretty safe to merge, even if the new implementation is buggy, as long as we don't enable it. 2. A new compaction algorithm, implemented using the new abstraction. The new algorithm is tiered compaction. It is inspired by the PoC at PR #4539, although I did not use that code directly, as I needed the new implementation to fit the new abstraction. The algorithm here is less advanced, I did not implement partial image layers, for example. I wanted to keep it simple on purpose, so that as we add bells and whistles, we can see the effects using the included simulator. One difference to #4539 and your typical LSM tree implementations is how we keep track of the LSM tree levels. This PR doesn't have a permanent concept of a level, tier or sorted run at all. There are just delta and image layers. However, when compaction starts, we look at the layers that exist, and arrange them into levels, depending on their shapes. That is ephemeral: when the compaction finishes, we forget that information. This allows the new algorithm to work without any extra bookkeeping. That makes it easier to transition from the old algorithm to new, and back again. There is just a new tenant config option to choose the compaction algorithm. The default is "Legacy", meaning the current algorithm in 'main'. If you set it to "Tiered", the new algorithm is used. 3. A simulator, which implements the new abstraction. The simulator can be used to analyze write and storage amplification, without running a test with the full pageserver. It can also draw an SVG animation of the simulation, to visualize how layers are created and deleted. To run the simulator: cargo run --bin compaction-simulator run-suite --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

) ## Problem This is a precursor to adding a convenience CLI for the storage controller. ## Summary of changes - move controller api structs into pageserver_api::controller_api to make them visible to other crates - rename pageserver_api::control_api to pageserver_api::upcall_api to match the /upcall/v1/ naming in the storage controller. Why here rather than a totally separate crate? It's convenient to have all the pageserver-related stuff in one place, and if we ever wanted to move it to a different crate it's super easy to do that later.

## Problem After commit [840abe3] (store AUX files as deltas) we avoid quadratic growth of storage size when storing LR snapshots but get quadratic slowdown of reconstruct time. As a result storing 70k snapshots at my local Neon instance took more than 3 hours and starting node (creation of basecbackup): ~10 minutes. In prod 70k AUX files cause increase of startup time to 40 minutes: https://neondb.slack.com/archives/C03F5SM1N02/p1708513010480179 ## Summary of changes Enforce storing full AUX directory (some analog of FPI) each 1024 files. Time of creation 70k snapshots is reduced to 6 minutes and startup time - to 1.5 minutes (100 seconds). ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

Not allowing evicting wanted deleted layers is something I've forgotten to implement on #5645. This PR makes it possible to evict such layers, which should reduce the amount of hanging evictions. Fixes: #6928 Co-authored-by: Christian Schwarz <christian@neon.tech>

## Problem Starting up the pageserver before the storage controller is ready can lead to a round of reconciliation, which leads to the previous tenant being shut down. This disturbs some tests. ## Summary of changes Wait for the storage controller to become ready on neon env start-up. Closes #6724

normal runtime seems to be 3min, add 20min timeout.

…6842 (#6938) This reverts commits 587cb70 (PR #6661) and fcbe9fb (PR #6842). Conflicts: pageserver/src/tenant.rs pageserver/src/tenant/timeline.rs The conflicts were with * pageserver: adjust checkpoint distance for sharded tenants (#6852) * pageserver: add vectored get implementation (#6576) Also we had to keep the `allowed_errors` to make `test_forward_compatibility` happy, see the PR thread on GitHub for details.

## Problem The vectored read path proposed in #6576 seems to be functionally correct, but in my testing (see below) it is about 10-20% slower than the naive sequential vectored implementation. ## Summary of changes There's three parts to this PR: 1. Supporting vectored blob reads. This is actually trickier than it sounds because on disk blobs are prefixed with a variable length size header. Since the blobs are not necessarily fixed size, we need to juggle the offsets such that the callers can retrieve the blobs from the resulting buffer. 2. Merge disk read requests issued by the vectored read path up to a maximum size. Again, the merging is complicated by the fact that blobs are not fixed size. We keep track of the begin and end offset of each blob and pass them into the vectored blob reader. In turn, the reader will return a buffer and the offsets at which the blobs begin and end. 3. A benchmark for basebackup requests against tenant with large SLRU block counts is added. This required a small change to pagebench and a new config variable for the pageserver which toggles the vectored get validation. We can probably optimise things further by adding a little bit of concurrency for our IO. In principle, it's as simple as spawning a task which deals with issuing IO and doing the serialisation and handling on the parent task which receives input via a channel.

## Problem Currently, after updating `Dockerfile.build-tools` in a PR, it requires a manual action to make it `pinned`, i.e., the default for everyone. It also makes all opened PRs use such images (even created in the PR and without such changes). This PR overhauls the way we build and use `build-tools` image (and uses the image from Docker Hub). ## Summary of changes - The `neondatabase/build-tools` image gets tagged with the latest commit sha for the `Dockerfile.build-tools` file - Each PR calculates the tag for `neondatabase/build-tools`, tries to pull it, and rebuilds the image with such tag if it doesn't exist. - Use `neondatabase/build-tools` as a default image - When running on `main` branch — create a `pinned` tag and push it to ECR - Use `concurrency` to ensure we don't build `build-tools` image for the same commit in parallel from different PRs

## Problem Hard to find error reasons by endpoint for HTTP flow. ## Summary of changes I want all root spans to have session id and endpoint id. I want all root spans to be consistent.

## Problem The "z" and "y" letters are switched on the English keyboard, and I'm used to a German keyboard. Very embarrassing. ## Summary of changes Fix syntax error in README Co-authored-by: Andreas Scherbaum <andreas@neon.tech>

## Problem `pin-build-tools-image` job doesn't have access to secrets and thus fails. Missed in the original PR[0] - [0] #6795 ## Summary of changes - pass secrets to `pin-build-tools-image` job

…quests (#6944) ## Problem Callers of the timeline creation API may issue timeline GETs ahead of creation to e.g. check if their intended timeline already exists, or to learn the LSN of a parent timeline. Although the timeline creation API already triggers activation of a timeline if it's currently waiting to activate, the GET endpoint doesn't, so such callers will encounter 503 responses for several minutes after a pageserver restarts, while tenants are lazily warming up. The original scope of which APIs will activate a timeline was quite small, but really it makes sense to do it for any API that needs a particular timeline to be active. ## Summary of changes - In the timeline detail GET handler, use wait_to_become_active, which triggers immediate activation of a tenant if it was currently waiting for the warmup semaphore, then waits up to 5 seconds for the activation to complete. If it doesn't complete promptly, we return a 503 as before. - Modify active_timeline_for_active_tenant to also use wait_to_become_active, which indirectly makes several other timeline-scope request handlers fast-activate a tenant when called. This is important because a timeline creation flow could also use e.g. get_lsn_for_timestamp as a precursor to creating a timeline. - There is some risk to this change: an excessive number of timeline GET requests could cause too many tenant activations to happen at the same time, leading to excessive queue depth to the S3 client. However, this was already the case for e.g. many concurrent timeline creations.

## Problem We build compute-tools binary twice — in `compute-node` and in `compute-tools` jobs, and we build them slightly differently: - `cargo build --locked --profile release-line-debug-size-lto` (previously in `compute-node`) - `mold -run cargo build -p compute_tools --locked --release` (previously in `compute-tools`) Before: - compute-node: **6m 34s** - compute-tools (as a separate job): **7m 47s** After: - compute-node: **7m 34s** - compute-tools (as a separate step, within compute-node job): **5s** ## Summary of changes - Move compute-tools image creation to `Dockerfile.compute-node` - Delete `Dockerfile.compute-tools`

github-actions · 2024-02-28T17:23:13Z

2424 tests run: 2303 passed, 0 failed, 121 skipped (full report)

Flaky tests (1)

Postgres 15

test_neon_superuser: debug

Code coverage* (full report)

functions: 28.6% (6887 of 24065 functions)
lines: 47.1% (42126 of 89460 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
54586d6 at 2024-02-28T17:23:13.113Z :recycle:}

danieltprice · 2024-02-29T19:31:56Z

Reviewed for 03-01-2024 changelog

problame and others added 30 commits February 26, 2024 10:05

remove gc_feedback mechanism (#6863)

dedf66b

It's been dead-code-at-runtime for 9 months, let's remove it. We can always re-introduce it at a later point. Came across this while working on #6861, which will touch `time_for_new_image_layer`. This is an opporunity to make that function simpler.

pageserver: remove two obsolete/unused per-timeline metrics (#6893)

5273c94

over-compensating the addition of a new per-timeline metric in #6834 part of #6737

Expose LFC cache size limit from sql_exporter (#6912)

459c2af

## Problem We want to report how much cache was used and what the limit was. ## Summary of changes Added one more query to sql_exporter to expose `neon.file_cache_size_limit`.

Update README, include cleanup details (#6816)

0881d4f

## Problem README.md is missing cleanup instructions ## Summary of changes Add cleanup instructions Add instructions how to handle errors during initialization --------- Co-authored-by: Andreas Scherbaum <andreas@neon.tech>

fix: only alter default privileges when public schema exists (#6914)

b2bbc20

## Problem Following up #6885, only alter default privileges when the public schema exists. Signed-off-by: Alex Chi Z <chi@neon.tech>

Export connection counts from sql_exporter (#6926)

2991d01

## Problem We want to show connection counts to console users. ## Summary of changes Start exporting connection counts grouped by database name and connection state.

proxy: introdice is cold start for analytics (#6902)

896d513

## Problem Data team cannot distinguish between cold start and not cold start. ## Summary of changes Report `is_cold_start` to analytics. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>

readme: Update Neon link URL (#6918)

c8ac4c0

## Problem ## Summary of changes Updates the neon.tech link to point to a /github page in order to correctly attribute visits originating from the repo.

build: add a timeout for test-images (#6942)

fcb77f3

normal runtime seems to be 3min, add 20min timeout.

proxy: refactor span usage (#6946)

48957e2

## Problem Hard to find error reasons by endpoint for HTTP flow. ## Summary of changes I want all root spans to have session id and endpoint id. I want all root spans to be consistent.

English keyboard has "z" and "y" switched (#6947)

edd8097

## Problem The "z" and "y" letters are switched on the English keyboard, and I'm used to a German keyboard. Very embarrassing. ## Summary of changes Fix syntax error in README Co-authored-by: Andreas Scherbaum <andreas@neon.tech>

CI(pin-build-tools-image): pass secrets to the job (#6949)

60a2324

## Problem `pin-build-tools-image` job doesn't have access to secrets and thus fails. Missed in the original PR[0] - [0] #6795 ## Summary of changes - pass secrets to `pin-build-tools-image` job

jcsp and others added 2 commits February 28, 2024 14:53

vipvap requested review from a team as code owners February 28, 2024 16:43

vipvap requested review from knizhnik, arssher, khanova, jcsp and mtyazici and removed request for a team February 28, 2024 16:43

petuhovskiy approved these changes Feb 29, 2024

View reviewed changes

petuhovskiy merged commit 1ebd389 into release Feb 29, 2024
171 of 172 checks passed

petuhovskiy deleted the rc/2024-02-28 branch February 29, 2024 16:39

problame mentioned this pull request Mar 4, 2024

Release 2024-02-27 #6937

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-02-28 #6956

Release 2024-02-28 #6956

vipvap commented Feb 28, 2024

github-actions bot commented Feb 28, 2024

Postgres 15

danieltprice commented Feb 29, 2024

Release 2024-02-28 #6956

Release 2024-02-28 #6956

Conversation

vipvap commented Feb 28, 2024