Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: EphmeralFiles can outlive their Timeline, new gate version #8332

Closed
wants to merge 7 commits into from

Conversation

koivunej
Copy link
Member

Ephemeral files cleanup on drop but did not delay shutdown, leading to problems with restarting the tenant. The solution is:

  • make ephemeral files carry the gate guard to delay Timeline::shutdown
  • flush in-memory layers and strong references to those on Timeline::shutdown
  • add a different gate LayerManager::ephmeral_files which is closed after all Timeline operations are closed

Fixes: #7830

@koivunej koivunej requested review from jcsp and problame July 10, 2024 08:26
@koivunej
Copy link
Member Author

The two-gate idea comes from @jcsp following the larger restructuring on #8229. Compared to the original solution where we track whether LayerManager is open or closed (shutdown), this version:

  • has a lot less changed lines (though some are "extra" on 8229)
  • now instead of type system we get the same safety from the runtime state of the two gates

@koivunej
Copy link
Member Author

Will push the doc fix after regress tests.

Copy link

github-actions bot commented Jul 10, 2024

3042 tests run: 2926 passed, 1 failed, 115 skipped (full report)


Failures on Postgres 14

  • test_pg_regress[4]: debug
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_pg_regress[debug-pg14-4]"
Flaky tests (1)

Postgres 16

  • test_isolation[None]: debug

Test coverage report is not available

The comment gets automatically updated with the latest test results
607c044 at 2024-07-10T10:50:17.257Z :recycle:

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now instead of type system we get the same safety from the runtime state of the two gates

I don't get how we get any of the safety I asked for in #8229 (comment)

@koivunej
Copy link
Member Author

koivunej commented Jul 10, 2024

now instead of type system we get the same safety from the runtime state of the two gates

I don't get how we get any of the safety I asked for in #8229 (comment)

Earlier comment:

By simply taking out the in-memory layers, Timeline::get will be giving out wrong answers because we're not resetting last_record_lsn.

The only thing that prevents us from actually giving out the wrong answer is that we check for the cancellation token. Too brittle for my taste.

The two gates is the answer:

  1. Timeline::cancel is cancelled
  2. Timeline::gate is closed
    • everything except ephemeral files (should) use that
  3. After that we drop the in-memory layers and close the added LayerManager::ephmeral_files: Gate

I couldn't spot any dire implementation error on my part, so could you please be more specific in case your question was not of general nature...?

@problame
Copy link
Contributor

I think we just disagree on whether we need robustness in depth or not.
This PR isn't implementing the robustness in depth that I was asking for in #8229 (comment) (enum LayerMap { ... }).

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Joonas and I had a call and he explained this PR to me.

It's a clever idea with few LoC but I would much prefer the robustness-in-depth that we gain from an explicitly tracked shutdown state.

See also my comment here: #8229 (comment)

@koivunej
Copy link
Member Author

koivunej commented Jul 15, 2024

Per Christians notes above, closing this down. Risk of no longer reading a should-be-present in-memory layer could cause us a lot of corruption.

@koivunej koivunej closed this Jul 15, 2024
@koivunej koivunej deleted the joonas/carry_inmem_gateguard_take2 branch July 15, 2024 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

timeline ancestor detach: race on restart with ephmeral file cleanup
2 participants