Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(timeline_detach_ancestor): success idempotency #8354

Merged
merged 26 commits into from
Jul 15, 2024

Conversation

koivunej
Copy link
Member

@koivunej koivunej commented Jul 11, 2024

Right now timeline detach ancestor reports an error (409, "no ancestor") on a new attempt after successful completion. This makes it troublesome for storage controller retries. Fix it to respond with 200 OK as if the operation had just completed quickly.

Additionally, the returned timeline identifiers in the 200 OK response are now ordered so that responses between different nodes for error comparison are done by the storage controller added in #8353.

Design-wise, this PR introduces a new strategy for accessing the latest uploaded IndexPart: RemoteTimelineClient::initialized_upload_queue(&self) -> Result<UploadQueueAccessor<'_>, NotInitialized>. It should be a more scalable way to query the latest uploaded IndexPart than to add a query method for each question directly on RemoteTimelineClient.

GC blocking will need to be introduced to make the operation fully idempotent. However, it is idempotent for the cases demonstrated by tests.

Cc: #6994

Copy link

github-actions bot commented Jul 11, 2024

3139 tests run: 3018 passed, 0 failed, 121 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_pg_regress[4]: debug

Postgres 14

Code coverage* (full report)

  • functions: 32.7% (6980 of 21377 functions)
  • lines: 50.0% (54992 of 109897 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
a3533f4 at 2024-07-15T17:55:16.606Z :recycle:

@koivunej
Copy link
Member Author

  • test_timeline_ancestor_detach_errors[True]: release

The flakyness was added first in the previous PR but only fixed on this one, so hopefully I can merge these quite soon to each other.

pageserver/src/http/routes.rs Show resolved Hide resolved
pageserver/src/http/routes.rs Show resolved Hide resolved
pageserver/src/tenant/remote_timeline_client/index.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline/detach_ancestor.rs Outdated Show resolved Hide resolved
@koivunej koivunej marked this pull request as ready for review July 12, 2024 16:47
@koivunej koivunej requested a review from a team as a code owner July 12, 2024 16:47
@koivunej koivunej requested review from jcsp and removed request for a team July 12, 2024 16:47
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just nits at this point - feel free to punt where it makes sense

@koivunej koivunej self-assigned this Jul 15, 2024
Base automatically changed from joonas/sharded_timeline_detach to main July 15, 2024 15:08
basically just a newtype around the mutexguard for accessing the
latest uploaded index part.
now historically detached timelines will on retry return 200 OK, with
stable ordered reparented timelines. these timelines will be subject to
future changes (like timeline archival, or having deleted some
timelines).
unsure why we report those in text; is it difficult to show the status code in
python?
unsure if we can really cancel timelinewriter with std-fs...
@koivunej koivunej enabled auto-merge (squash) July 15, 2024 15:14
@koivunej koivunej merged commit 730db85 into main Jul 15, 2024
65 checks passed
@koivunej koivunej deleted the joonas/idempotent_tad branch July 15, 2024 17:47
problame pushed a commit that referenced this pull request Jul 22, 2024
Right now timeline detach ancestor reports an error (409, "no ancestor")
on a new attempt after successful completion. This makes it troublesome
for storage controller retries. Fix it to respond with `200 OK` as if
the operation had just completed quickly.

Additionally, the returned timeline identifiers in the 200 OK response
are now ordered so that responses between different nodes for error
comparison are done by the storage controller added in #8353.

Design-wise, this PR introduces a new strategy for accessing the latest
uploaded IndexPart:
`RemoteTimelineClient::initialized_upload_queue(&self) ->
Result<UploadQueueAccessor<'_>, NotInitialized>`. It should be a more
scalable way to query the latest uploaded `IndexPart` than to add a
query method for each question directly on `RemoteTimelineClient`.

GC blocking will need to be introduced to make the operation fully
idempotent. However, it is idempotent for the cases demonstrated by
tests.

Cc: #6994
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants