-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(scrubber): more robust metadata consistency checks #8344
Conversation
3126 tests run: 3005 passed, 0 failed, 121 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
a6cfed9 at 2024-07-22T13:48:58.375Z :recycle: |
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
53c814c
to
455c5d6
Compare
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
7c988e0
to
eaf3a3b
Compare
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a quick review, have you tried running this with staging pageserver? I assume this process is read-only and it should be safe to run it.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Tested on staging, logged warnings and errors as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, couple of minor comments.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
The histograms in the metadata summary were added experimentally. We will remove them since we can get most of the insights from prometheus stats. Also reduce noise when maximum value go out of the range. --------- Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Part of #8128.
Problem
Scrubber uses
scan_metadata
command to flag metadata inconsistencies. To trust it at scale, we need to make sure the errors we emit is a reflection of real scenario. One check performed in the scrubber is to see whether layers listed in the latestindex_part.json
is present in object listing. Currently, the scrubber does not robustly handle the case where objects are uploaded/deleted during the scan. Below is some traces on possible race conditions that produces false positives.Case 1: Layer insertion
In the above example, B is actually present on S3. To resolve this, we can issue a
HeadObject
to the missing layers and see if they are actually present.Case 2: Layer deletion (impossible)This case is made impossible by always read the tenant objects listing first before downloading the index.
Case 3: Fast L0 replacement
This is very unlikely but theoretically possible. Instead of rescanning the index, the implementation will emit the error with a caveat indicating the missing layer is L0.
Summary of changes
Condition for success: An object in the index is (1) in the object listing we acquire from S3 or (2) found in a HeadObject request (new object).
HeadObject
requests for the layers missing from the object listing.deleted_at
tombstone is marked inindex_part.json
.Misc
Caveat
Checklist before requesting a review
Checklist before merging