Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 2024-01-15 #6354

Merged
merged 78 commits into from
Jan 15, 2024
Merged

Release 2024-01-15 #6354

merged 78 commits into from
Jan 15, 2024

Conversation

vipvap
Copy link

@vipvap vipvap commented Jan 15, 2024

Release 2024-01-15

Please merge this PR using 'Create a merge commit'!

arssher and others added 30 commits January 1, 2024 20:44
Otherwise they are left orphaned when compute_ctl is terminated with a
signal. It was invisible most of the time because normally neon_local or k8s
kills postgres directly and then compute_ctl finishes gracefully. However, in
some tests compute_ctl gets stuck waiting for sync-safekeepers which
intentionally never ends because safekeepers are offline, and we want to stop
compute_ctl without leaving orphanes behind.

This is a quite rough approach which doesn't wait for children termination. A
better way would be to convert compute_ctl to async which would make waiting
easy.
It was giving WAL only up to commit_lsn instead of flush_lsn, so recovery of
uncommitted WAL since cdb08f0 hanged. Add test for this.
To enable them in safekeeper as well.
Just a copy paste from pageserver.
To exercise MAX_SEND_SIZE sending from safekeeper; we've had a bug with WAL
records torn across several XLogData messages. Add failpoint to safekeeper to
slow down sending. Also check for corrupted WAL complains in standby log.

Make the test a bit simpler in passing, e.g. we don't need explicit commits as
autocommit is enabled by default.

https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719
neondatabase/cloud#9057
…epers.

As protocol demands. Not following this makes standby complain about corrupted
WAL in various ways.

https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719
closes neondatabase/cloud#9057
…6219)

The tool still needs a lot of work. These are the easiest fix and
feature:
- use similar adaptive config with s3 as remote_storage, use retries
- process only particular tenants

Tenants need to be from the correct region, they are not deduplicated,
but the feature is useful for re-checking small amount of tenants after
a large run.
- rename walpop_log to wp_log
- create also wpg_log which is used in postgres-specific code
- in passing format messages to start with lower case
Already checked in GetLogRepRestartLSN, a rebase artifact.
## Problem

The version of pytest we were using emits a number of
DeprecationWarnings on latest python: these are fixed in latest release.

boto3 and python-dateutil also have deprecation warnings, but
unfortunately these aren't fixed upstream yet.



## Summary of changes

- Update pytest
- Update boto3 (this doesn't fix deprecation warnings, but by the time I
figured that out I had already done the update, and it's good hygiene
anyway)
## Problem
For context, this problem was observed in a research project where we
try to make neon run in multiple regions and I was asked by @hlinnaka to
make this PR.

In our project, we use the pageserver in a non-conventional way such
that we would send a larger number of requests to the pageserver than
normal (imagine postgres without the buffer pool). I measured the time
from the moment a WAL record left the safekeeper to when it reached the
pageserver
([code](https://github.com/umd-dslam/sunstorm-neon/blob/e593db1f5ab2505eb176c9faaf2e9b9ba36cb2c4/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs#L282-L287))
and observed that when the number of get_page_at_lsn requests was high,
the wal receiving time increased significantly (see the left side of the
graphs below).

Upon further investigation, I found that the delay was caused by this
line


https://github.com/neondatabase/neon/blob/d2ca4109191e92a9da340184e5bc71768853fe8e/pageserver/src/tenant/timeline.rs#L2348

The `get_layer_for_write` method is called for every value during WAL
ingestion and it tries to acquire layers write lock every time, thus
this results in high contention when read lock is acquired more
frequently.


![Untitled](https://github.com/neondatabase/neon/assets/6244849/85460f4d-ead1-4532-bc64-736d0bfd7f16)

![Untitled2](https://github.com/neondatabase/neon/assets/6244849/84199ab7-5f0e-413b-a42b-f728f2225218)

## Summary of changes

It is unnecessary to call `get_layer_for_write` repeatedly for all
values in a WAL message since they would end up in the same memory layer
anyway, so I created the batched versions of `InMemoryLayer::put_value`,
`InMemoryLayer ::put_tombstone`, `Timeline::put_value`, and
`Timeline::put_tombstone`, that acquire the locks once for a batch of
values.

Additionally, `DatadirModification` is changed to store multiple
versions of uncommitted values, and `WalIngest::ingest_record()` can now
ingest records without immediately committing them.

With these new APIs, the new ingestion loop can be changed to commit for
every `ingest_batch_size` records. The `ingest_batch_size` variable is
exposed as a config. If it is set to 1 then we get the same behavior
before this change. I found that setting this value to 100 seems to work
the best, and you can see its effect on the right side of the above
graphs.

---------

Co-authored-by: John Spray <john@neon.tech>
In tests that evict layers, explicit eviction can race with automatic
eviction of the same layer and result in a 304
## Problem

We need to add one more patch to pgbouncer (for
#5801). I've decided to
cherry-pick all required patches to a pgbouncer fork
(`neondatabase/pgbouncer`) and use it instead.

See
https://github.com/neondatabase/pgbouncer/releases/tag/pgbouncer_1_21_0-neon-1

## Summary of changes
- Revert the previous patch (for deallocate/discard all) — the fork
already contains it.
- Remove `libssl-dev` dependency — we build pgbouncer without `openssl`
support.
- Clone git tag and build pgbouncer from source code.
It has caveats such as creating half empty segment which can't be
offloaded. Instead we'll pursue approach of pull_timeline, seeding new state
from some peer.
## Problem


https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6254/7388706419/index.html#suites/5a4b8734277a9878cb429b80c314f470/e54c4f6f6ed22672

## Summary of changes

Permit the log message: because the test helper's detach function
increments the generation number, a detach/attach cycle can cause the
error if the test runner node is slow enough for the opportunistic
deletion queue flush on detach not to complete by the time we call
attach.
If there is some secondary shard for a tenant on the same
node as an attached shard, the secondary shard could trip up
this code and cause page_service to incorrectly
get an error instead of finding the attached shard.
Previously, we would wait for the LSN to be visible on whichever
timeline we happened to load at the start of the connection, then
proceed to look up the correct timeline for the key and do the read.

If the timeline holding the key was behind the timeline we used
for the LSN wait, then we might serve an apparently-successful read result
that actually contains data from behind the requested lsn.
This does not have a functional impact, but enables all
the logging in this code to include the shard_id
label.
Otherwise an assertion in CONCURRENT_BACKGROUND_TASKS will
trip if you try to run the pageserver on a single core.
… not found (#6262)

## Problem

- When a client requests a key that isn't found in any shard on the node
(edge case that only happens if a compute's config is out of date), we
should prompt them to reconnect (as this includes a backoff), since they
will not be able to complete the request until they eventually get a
correct pageserver connection string.
- QueryError::Other is used excessively: this contains a type-ambiguous
anyhow::Error and is logged very verbosely (including backtrace).

## Summary of changes

- Introduce PageStreamError to replace use of anyhow::Error in request
handlers for getpage, etc.
- Introduce Reconnect and NotFound variants to QueryError
- Map the "shard routing error" case to PageStreamError::Reconnect ->
QueryError::Reconnect
- Update type conversions for LSN timeouts and tenant/timeline not found
errors to use PageStreamError::NotFound->QueryError::NotFound
Implement API for cloning a single timeline inside a safekeeper. Also
add API for calculating a sha256 hash of WAL, which is used in tests.

`/copy` API works by copying objects inside S3 for all but the last
segments, and the last segments are copied on-disk. A special temporary
directory is created for a timeline, because copy can take a lot of
time, especially for large timelines. After all files segments have been
prepared, this directory is mounted to the main tree and timeline is
loaded to memory.

Some caveats:
- large timelines can take a lot of time to copy, because we need to
copy many S3 segments
- caller should wait for HTTP call to finish indefinetely and don't
close the HTTP connection, because it will stop the process, which is
not continued in the background
- `until_lsn` must be a valid LSN, otherwise bad things can happen
- API will return 200 if specified `timeline_id` already exists, even if
it's not a copy
- each safekeeper will try to copy S3 segments, so it's better to not
call this API in-parallel on different safekeepers
arpad-m and others added 2 commits January 13, 2024 09:15
Fixes the race condition between timeline creation and tenant deletion
outlined in #6255.

Related: #5914, which is a similar race condition about the uninit
marker file.

Fixes #6255
## Problem

Se.e
https://github.com/orgs/neondatabase/projects/49/views/13?pane=issue&itemId=48282912

## Summary of changes


Do not suspend compute if there are active auto vacuum workers

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
@vipvap vipvap requested review from a team as code owners January 15, 2024 06:01
@vipvap vipvap requested review from knizhnik, petuhovskiy, shanyp and piercypixel and removed request for a team January 15, 2024 06:01
vipvap and others added 4 commits January 15, 2024 09:28
…#6286)

Release PR #6286 got
accidentally merged-by-squash intstead of merge-by-merge-commit.

This commit shows how things would look like if 6286 had been
merged-by-squash.

```
git reset --hard 9f13277
git merge --no-ff 5c0264b
```

Co-authored-by: Christian Schwarz <christian@neon.tech>
@problame
Copy link
Contributor

Release #6286 was accidentally merged by merge-by-squash instead of merge-by-merge-commit.

Cleaned this up together with @arpad-m as follows:

Create alternative reality branch that looks like what would have happened if we had merged 6286 by merge commit: origin/releases/2024-01-08--not-squashed

Then push into this branch commits

commit 483b66d383b0ebbdd85e2b2a3df12a88cf58ff38
Merge: 9f1327772 5c0264b59
Author: vipvap <91739071+vipvap@users.noreply.github.com>
Date:   Mon Jan 8 09:26:27 2024 +0000

    Merge branch 'release' into releases/2024-01-08 (not-squashed merge of #6286)
    
    Release PR https://github.com/neondatabase/neon/pull/6286 got
    accidentally merged-by-squash intstead of merge-by-merge-commit.
    
    This commit shows how things would look like if 6286 had been
    merged-by-squash.
    
    ```
    git reset --hard 9f1327772
    git merge --no-ff 5c0264b59119c67a121d2fc81bf7da3abe8afd90
    ```
    
    Co-authored-by: Christian Schwarz <christian@neon.tech>

commit 21315e80bc74b7c75342a574403f9aa05d469d93
Merge: 31a4eb40b 483b66d38
Author: Christian Schwarz <christian@neon.tech>
Date:   Mon Jan 15 09:31:07 2024 +0000

    Merge branch 'releases/2024-01-08--not-squashed' into releases/2024-01-15

commit d424f2b7c81b7448729e06917e6d328d91910992
Author: Christian Schwarz <christian@neon.tech>
Date:   Mon Jan 15 09:36:22 2024 +0000

    empty commit so we can produce a merge commit

commit 2f0f9edf339343c50c888e451ed5033b30da3df2
Merge: d424f2b7c aa72a2266
Author: Christian Schwarz <christian@neon.tech>
Date:   Mon Jan 15 09:36:42 2024 +0000

    Merge remote-tracking branch 'origin/release' into releases/2024-01-15

Copy link

github-actions bot commented Jan 15, 2024

2250 tests run: 2165 passed, 0 failed, 85 skipped (full report)


Flaky tests (5)

Postgres 15

  • test_crafted_wal_end[wal_record_crossing_segment_followed_by_small_one]: release
  • test_crafted_wal_end[last_wal_record_crossing_segment]: release, debug
  • test_tenant_delete_races_timeline_creation: debug

Postgres 14

  • test_tenant_delete_races_timeline_creation: debug

Code coverage (full report)

  • functions: 54.6% (10207 of 18705 functions)
  • lines: 81.4% (58656 of 72018 lines)

The comment gets automatically updated with the latest test results
2f0f9ed at 2024-01-15T11:18:47.775Z :recycle:

@problame problame merged commit 93450f1 into release Jan 15, 2024
45 checks passed
@problame problame deleted the releases/2024-01-15 branch January 15, 2024 13:30
@danieltprice
Copy link
Contributor

danieltprice commented Jan 16, 2024

I see two items here for the Friday changelog regarding compute suspend when there is an active logical replication connection & when autovacuum is running. Please let me know if I missed anything that is user-visible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.