Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fsync+rename for atomic downloads from remote storage #1571

Merged
merged 2 commits into from
Apr 29, 2022

Conversation

LizardWizzard
Copy link
Contributor

@LizardWizzard LizardWizzard commented Apr 25, 2022

The problem I have is that we need to build pageserver with the failpoint feature enabled. Probably we do not need that for production builds, do we? I plan to enable it for debug builds only so it can be used in tests properly

I guess existing failpoint invocation in test_backpressure is inactive

Copy link
Contributor

@SomeoneToIgnore SomeoneToIgnore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks simpler than I've anticipated, great.
Not sure how to help with failpoints detection, I think it's ok for now.

pageserver/src/page_service.rs Outdated Show resolved Hide resolved
pageserver/src/remote_storage/storage_sync.rs Outdated Show resolved Hide resolved
pageserver/src/remote_storage/storage_sync/download.rs Outdated Show resolved Hide resolved
pageserver/src/remote_storage/storage_sync/download.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I have is that we need to build pageserver with the failpoint feature enabled. Probably we do not need that for production builds, do we? I plan to enable it for debug builds only so it can be used in tests properly

I'm cool with always compiling with failpoint support, even for production. They're basically free when not enabled.

I guess existing failpoint invocation in test_backpressure is inactive

Oh... Can you check what's going on and open a separate issue or something, please?

pageserver/src/remote_storage/storage_sync/download.rs Outdated Show resolved Hide resolved
@LizardWizzard
Copy link
Contributor Author

@hlinnaka

They're basically free when not enabled.

Hmm, I do not think so. If failpoint support is enabled for binary it performs lookup in a global hashmap to check whether it needs to trigger a particular failpoint or not.
See https://github.com/tikv/fail-rs/blob/master/src/lib.rs#L637 and https://github.com/tikv/fail-rs/blob/master/src/lib.rs#L472

It shouldn't affect things too much until we do not place failpoints in hot code paths. I think enabling only for debug builds is a safe bet. What do you think?

Can you check what's going on and open a separate issue or something, please?

If I try to run this test on this branch with default make on this branch I got:
Cannot manage failpoints because pageserver was compiled without failpoints support

There are several options:

  • The easiest way is to just enable failpoints everywhere.
  • Otherwise usage of failpoints in tests will need some coordination with binary's build type. E g do not run failpoint tests for release builds which is sad.
  • Or we need to build several binaries with/without failpoints enable
  • We can add a new config variable and wrap each failpoint in a conditional around it so even if binary is compiled with failpoint support it will check for config first and then dive into failpoint crate internals.

Personally, I'm for the simplest option :) We do not use failpoint in hot places, and we'll see it in flame graphs if it'll become noticeable. Thoughts?

@hlinnaka
Copy link
Contributor

hlinnaka commented Apr 25, 2022

  • We can add a new config variable and wrap each failpoint in a conditional around it so even if binary is compiled with failpoint support it will check for config first and then dive into failpoint crate internals.

Yeah. I'm surprised 'fail-rs' doesn't do that itself.

Personally, I'm for the simplest option :) We do not use failpoint in hot places, and we'll see it in flame graphs if it'll become noticeable. Thoughts?

+1

Copy link
Contributor

@SomeoneToIgnore SomeoneToIgnore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted a few unnecessary nits, but looks nice now.

} else {
timeline_files.insert(entry_path);
}
}
}

// FIXME (rodionov) if attach call succeeded, and then pageserver is restarted before download is completed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to ensure that no empty dirs present (where empty == dirs without metadata file and at least one layer) and remove such on startup, to not to conflict with the tenant init code (that will error on such case otherwise).

But not needed to be implemented now.

pageserver/src/remote_storage/storage_sync.rs Show resolved Hide resolved
pageserver/Cargo.toml Show resolved Hide resolved
pageserver/src/remote_storage/storage_sync/download.rs Outdated Show resolved Hide resolved
pageserver/src/remote_storage/storage_sync/download.rs Outdated Show resolved Hide resolved
pageserver/src/remote_storage/storage_sync/download.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use fsyncs/durable rename to protect downloaded files/metadata. Think about edge cases
3 participants