Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: cleanup of layers from the future can race with their re-creation #5890

Merged
merged 9 commits into from
Nov 23, 2023

Conversation

problame
Copy link
Contributor

@problame problame commented Nov 21, 2023

fixes #5878
obsoletes #5879

Before this PR, it could happen that load_layer_map schedules removal of the future
image layer. Then a later compaction run could re-create the same image layer, scheduling a PUT.
Due to lack of an upload queue barrier, the PUT and DELETE could be re-ordered.
The result was IndexPart referencing a non-existent object.

Summary of changes

  • Add support to pagectl / Python tests to decode IndexPart
    • Rust
      • new pagectl Subcommand
      • IndexPart::{from,to}_s3_bytes() methods to internalize knowledge about encoding of IndexPart
    • Python
      • new NeonCli subclass
  • Add regression test
    • Rust
      • Ability to force repartitioning; required to ensure image layer creation at last_record_lsn
    • Python
      • The regression test.
  • Fix the issue
    • Insert an UploadOp::Barrier after scheduling the deletions.

Copy link

github-actions bot commented Nov 21, 2023

2400 tests run: 2272 passed, 0 failed, 128 skipped (full report)


Flaky tests (8)

Postgres 16

  • test_import_from_pageserver_small: debug
  • test_ondemand_download_timetravel[mock_s3]: debug
  • test_emergency_mode: release

Postgres 15

  • test_gc_cutoff: debug
  • test_timeline_deletion_with_files_stuck_in_upload_queue: release
  • test_remote_storage_backup_and_restore[False-real_s3]: debug
  • test_restarts_frequent_checkpoints: release

Postgres 14

  • test_timeline_delete_works_for_remote_smoke[real_s3]: debug

Code coverage (full report)

  • functions: 54.2% (8966 of 16549 functions)
  • lines: 81.5% (52289 of 64160 lines)

The comment gets automatically updated with the latest test results
f8dd44b at 2023-11-23T12:34:07.482Z :recycle:

@problame problame changed the title Problame/fix-deletion-of-future-image-layers fix: future layer deletion can race with re-creation Nov 21, 2023
@problame problame changed the title fix: future layer deletion can race with re-creation fix: cleanup of layers from the future can race with their re-creation Nov 21, 2023
@problame problame force-pushed the problame/fix-deletion-of-future-image-layers branch from ec516f1 to d8c9e3c Compare November 21, 2023 17:18
@problame problame force-pushed the problame/fix-deletion-of-future-image-layers branch from d8c9e3c to 38650b1 Compare November 21, 2023 18:24
@problame problame marked this pull request as ready for review November 21, 2023 18:46
@problame problame requested a review from a team as a code owner November 21, 2023 18:46
@problame problame requested review from hlinnaka, koivunej and arpad-m and removed request for a team and hlinnaka November 21, 2023 18:46
* make clearer why pagectl command instead of reading index part directly
    * technically this PR doesn't need it but I think it's good to have
      it in place
* no mtime instead of inode number to detect overwrites
* use log scraping to detect that we arrived at failpoint
* adjust comments to be less reproducer-y
* additional check for overwrite at end of test to ensure the PUT won

Tested that without the fix commit, it still reproduces.
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oki now I found the st_mtime_ns changes as well, probably just did not see them. I'm fine with them or the original inodes.

Please do reconsider turning the if file_mismatch: log + break into assert as I suggested in the unresolved thread, I at least cannot see an upside from tolerating this.

pageserver/src/http/routes.rs Outdated Show resolved Hide resolved
pageserver/src/http/routes.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this is a really great PR

@problame problame enabled auto-merge (squash) November 22, 2023 16:30
@arpad-m
Copy link
Member

arpad-m commented Nov 22, 2023

CI failure is legit. Black wants this change:

-    time.sleep(1.1) # so that we can use change in pre_stat.st_mtime to detect overwrites
+    time.sleep(1.1)  # so that we can use change in pre_stat.st_mtime to detect overwrites

@problame problame merged commit a0e6114 into main Nov 23, 2023
42 checks passed
@problame problame deleted the problame/fix-deletion-of-future-image-layers branch November 23, 2023 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

future layer deletion can race with re-creation
3 participants