Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(pageserver): warn on delete non-existing file #7847

Merged
merged 1 commit into from
May 30, 2024

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented May 22, 2024

Problem

Consider the following sequence of migration:

1. user starts compute
2. force migrate to v2
3. user continues to write data

At the time of (3), the compute node is not aware that the page server does not contain replication states any more, and might continue to ingest neon-file records into the safekeeper. This will leave the pageserver store a partial replication state and cause some errors. For example, the compute could issue a deletion of some aux files in v1, but this file does not exist in v2. Therefore, we should ignore all these errors until everyone is migrated to v2.

Also note that if we see this warning in prod, it is likely because we did not fully suspend users' compute when flipping the v1/v2 flag.

Summary of changes

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@skyzh skyzh requested a review from a team as a code owner May 22, 2024 19:00
@skyzh skyzh requested a review from koivunej May 22, 2024 19:00
Copy link

github-actions bot commented May 22, 2024

3150 tests run: 3017 passed, 0 failed, 133 skipped (full report)


Flaky tests (1)

Postgres 16

  • test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

  • functions: 31.4% (6492 of 20672 functions)
  • lines: 48.4% (50207 of 103759 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
913c1d4 at 2024-05-30T14:54:51.223Z :recycle:

@skyzh skyzh requested a review from arpad-m May 23, 2024 14:46
@skyzh skyzh force-pushed the skyzh/warn-instead-of-err-aux-remove branch from a23788d to 6c2da64 Compare May 23, 2024 16:40
Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh skyzh force-pushed the skyzh/warn-instead-of-err-aux-remove branch from 6c2da64 to 913c1d4 Compare May 30, 2024 14:02
@skyzh skyzh enabled auto-merge (squash) May 30, 2024 14:05
@skyzh skyzh merged commit f20a9e7 into main May 30, 2024
58 checks passed
@skyzh skyzh deleted the skyzh/warn-instead-of-err-aux-remove branch May 30, 2024 14:45
a-masterov pushed a commit that referenced this pull request Jun 3, 2024
Consider the following sequence of migration:

```
1. user starts compute
2. force migrate to v2
3. user continues to write data
```

At the time of (3), the compute node is not aware that the page server
does not contain replication states any more, and might continue to
ingest neon-file records into the safekeeper. This will leave the
pageserver store a partial replication state and cause some errors. For
example, the compute could issue a deletion of some aux files in v1, but
this file does not exist in v2. Therefore, we should ignore all these
errors until everyone is migrated to v2.

Also note that if we see this warning in prod, it is likely because we
did not fully suspend users' compute when flipping the v1/v2 flag.

Signed-off-by: Alex Chi Z <chi@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants