Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: fix API-driven secondary downloads possibly colliding with background downloads #7848

Merged
merged 5 commits into from
May 23, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented May 22, 2024

Problem

We've seen some strange behaviors when doing lots of migrations involving secondary locations. One of these was where a tenant was apparently stuck in the Scheduler::running list, but didn't appear to be making any progress. Another was a shutdown hang (https://github.com/neondatabase/cloud/issues/13576).

Summary of changes

  • Fix one issue (probably not the only one) where a tenant in the pending list could proceed to spawn even if the same tenant already had a running task via handle_command (this could have resulted in a weird value of SecondaryProgress)
  • Add various extra logging:
    • log before as well as after layer downloads so that it would be obvious if we were stuck in remote storage code (we shouldn't be, it has built in timeouts)
    • log the number of running + pending jobs from the scheduler every time it wakes up to do a scheduling iteration (~10s) -- this is quite chatty, but not compared with the volume of logs on a busy pageserver. It should give us confidence that the scheduler loop is still alive, and visibility of how many tasks the scheduler thinks are running.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels May 22, 2024
@jcsp jcsp requested a review from a team as a code owner May 22, 2024 20:14
@jcsp jcsp requested a review from arpad-m May 22, 2024 20:14
Copy link

3108 tests run: 2982 passed, 0 failed, 126 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_timeline_deletion_with_files_stuck_in_upload_queue: debug

Postgres 15

  • test_download_remote_layers_api: debug

Code coverage* (full report)

  • functions: 31.4% (6450 of 20515 functions)
  • lines: 48.3% (49849 of 103247 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
3aaef36 at 2024-05-22T21:02:05.315Z :recycle:

@jcsp jcsp merged commit a43a1ad into main May 23, 2024
67 checks passed
@jcsp jcsp deleted the jcsp/secondary-command-hangs branch May 23, 2024 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants