-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: cleanup of layers from the future can race with their re-creation #5890
Conversation
2400 tests run: 2272 passed, 0 failed, 128 skipped (full report)Flaky tests (8)Postgres 16
Postgres 15
Postgres 14
Code coverage (full report)
The comment gets automatically updated with the latest test results
f8dd44b at 2023-11-23T12:34:07.482Z :recycle: |
ec516f1
to
d8c9e3c
Compare
d8c9e3c
to
38650b1
Compare
* make clearer why pagectl command instead of reading index part directly * technically this PR doesn't need it but I think it's good to have it in place * no mtime instead of inode number to detect overwrites * use log scraping to detect that we arrived at failpoint * adjust comments to be less reproducer-y * additional check for overwrite at end of test to ensure the PUT won Tested that without the fix commit, it still reproduces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oki now I found the st_mtime_ns
changes as well, probably just did not see them. I'm fine with them or the original inodes.
Please do reconsider turning the if file_mismatch: log + break
into assert as I suggested in the unresolved thread, I at least cannot see an upside from tolerating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, this is a really great PR
CI failure is legit. Black wants this change: - time.sleep(1.1) # so that we can use change in pre_stat.st_mtime to detect overwrites
+ time.sleep(1.1) # so that we can use change in pre_stat.st_mtime to detect overwrites |
fixes #5878
obsoletes #5879
Before this PR, it could happen that
load_layer_map
schedules removal of the futureimage layer. Then a later compaction run could re-create the same image layer, scheduling a PUT.
Due to lack of an upload queue barrier, the PUT and DELETE could be re-ordered.
The result was IndexPart referencing a non-existent object.
Summary of changes
pagectl
/ Python tests to decodeIndexPart
pagectl
SubcommandIndexPart::{from,to}_s3_bytes()
methods to internalize knowledge about encoding ofIndexPart
NeonCli
subclassUploadOp::Barrier
after scheduling the deletions.