-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and add test case for issue 7791 #7795
Conversation
3102 tests run: 2975 passed, 0 failed, 127 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
1a3337c at 2024-05-22T12:58:52.541Z :recycle: |
Blocked by minor version upgrade (should be solved by now). |
Ok, the test is now passing with this fix, and the fix from #7803. I verified that #7794 now also passes this test. #7794 does more or less the same as this. The differences are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fix overall looks good to me. But it leaves me questioning the design of locking pages and applying changes one at a time. I guess that's inside Postgres, not really under our control
I'd think that we'd just want these replays to happen atomically across all pages. In other words, what makes more sense to me is to lock all the pages first, then apply all changes, and then I think this sort of problem just isn't possible by construction? Definitely could be missing something though.
Postgres redo functions do two-phase locking of the pages, ie. they don't release any of the locks until all the pages have been locked and modified. So the changes do appear atomic to any other backend that's looking. We are violating that in neon with the hack that we skip the WAL replay of pages that are not currently in the buffer cache. That's why we need the XLogWaitForReplayOf() call. That feels risky to me. I don't see any immediate bug with that, but I think it would be better if the redo process inserted an entry to the buffer manager for the page, without doing the I/O to read it in, and held the lock on the buffer as usual until the whole record has been replayed. That would require more invasive changes to the Postgres buffer manager code, however. |
Sorry, I wrote that comment before I had actually pushed that change to this branch. Pushed it now. |
The code was working correctly, but was incorrectly using Buffer for a 0-based index into the BufferDesc array.
Don't set last-written LSN of a page when the record is replayed, only when the page is evicted from cache. For comparison, we don't update the last-written LSN on every page modification on the primary either, only when the page is evicted. Do update the last-written LSN when the page update is skipped in WAL redo, however. In neon_get_request_lsns(), don't be surprised if the last-written LSN is equal to the record being replayed. Use the LSN of the record being replayed as the request LSN in that case. Add a long comment explaining how that can happen. In neon_wallog_page, update last-written LSN also when Shutdown has been requested. We might still fetch and evict pages for a while, after shutdown has been requested, so we better continue to do that correctly. Enable the check that we don't evict a page with zero LSN also in standby, but make it a LOG message instead of PANIC Fixes issue #7791
Don't set last-written LSN of a page when the record is replayed, only when the page is evicted from cache. For comparison, we don't update the last-written LSN on every page modification on the primary either, only when the page is evicted. Do update the last-written LSN when the page update is skipped in WAL redo, however. In neon_get_request_lsns(), don't be surprised if the last-written LSN is equal to the record being replayed. Use the LSN of the record being replayed as the request LSN in that case. Add a long comment explaining how that can happen. In neon_wallog_page, update last-written LSN also when Shutdown has been requested. We might still fetch and evict pages for a while, after shutdown has been requested, so we better continue to do that correctly. Enable the check that we don't evict a page with zero LSN also in standby, but make it a LOG message instead of PANIC Fixes issue #7791
This is a test case to repro issue #7791, and a fix.
Sometimes this fails with this assertion, which seems consistent with my hypothesis at #7791 (comment):
But other times this hits the same assertion in the startup process: