Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_vm_bit_clear_on_heap_lock: AssertionError: assert '550000000000……0000000000000'=='0000000.…0000000000000 #7984

Closed
problame opened this issue Jun 6, 2024 · 2 comments · Fixed by #7989
Assignees
Labels
a/test/flaky Area: related to flaky tests a/test Area: related to testing c/storage Component: storage t/bug Issue Type: Bug

Comments

@problame
Copy link
Contributor

problame commented Jun 6, 2024

Problem

The test_vm_bit_clear_on_heap_lock test occasionally fails with

assert vm_page_at_pageserver == vm_page_in_cache
AssertionError: assert '550000000000……0000000000000'=='0000000.…0000000000000
...
test_runner/regress/test_vm_bits.py:185: AssertionError

image

Example Allure report that encountered the issue: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7979/9395161864/index.html#testresult/604af091fcc61d51/retries

Initial Analysis

(by @hlinnaka )

I think this is indeed an orthogonal issue. Hypothesis for how this happens:

  1. When we run the VACUUM (FREEZE, DISABLE_PAGE_SKIPPING true) vmtest_lock command, autovacuum/autoanalyze is running in background and has an active snapshot, which prevents VACUUM FREEZE from freezing the rows. That’s how you sometimes get an all-zeros VM page.

  2. Between the calls to get the page from cache, and from the pageserver, autovacuum runs and has vacuumed just a few of the pages.

If that hypothesis holds, this is a case of an unstable test, not a bug as such. We could disable autovacuum for this first part of the test (the second part requires it to be on, though) to make it more stable.

I think you can see this failure more often if you add a sleep between the get_raw_page and get_raw_page_at_lsn calls.

I think we should make the test more robust so that it actually sets all the VM bits in step 1. That would make the hack to ignore the page header unnecessary, too. Maybe check that the VM bits are set and issue the VACUUM again. Or maybe disabling autovacuum is enough to stabilize it.

Distinction From #6967

This flakiness was discovered while investigating #6967 but is believed to be orthogonal.
See https://www.notion.so/neondatabase/2024-06-05-vm-flakiness-investigation-ee3ed981aa974635bf475a7e5aebe19e?pvs=4#64d17080b88e4c3e8204f79dd337627f

@problame problame added t/bug Issue Type: Bug a/test Area: related to testing c/storage/wal c/storage Component: storage a/test/flaky Area: related to flaky tests labels Jun 6, 2024
@hlinnaka hlinnaka self-assigned this Jun 6, 2024
@hlinnaka
Copy link
Contributor

hlinnaka commented Jun 6, 2024

So, I believe the pageserver assertion failure is a pageserver bug. I don't know why this particular test triggers it, but I'm glad it did. Once we have hunted down that bug, hopefully we can have a more targeted regression test. But there are remaining flakiness issues with the test itself. In particular, in the first part that compares the in-cache page version with the version it gets from pageserver. Depending on timing, when autovacuum runs etc, it can legitimately get different versions. To make the test less flaky, some suggestions:

  • Split the first and second parts of the test to two separate tests
  • In the first test, disable the aggressive GC, compaction, and autovacuum. They are only needed by the second test.
  • I'd like to get the first test to a point that the VM page is never all-zeros. Disabling autovacuum in the first test is hopefully enough to accomplish that. But if not, we can explicitly check for that, and e.g. retry the VACUUM, until the VM bits are set.
  • Compare the full page images, don't skip the LSN and page header. After fixing the previous point, there should be no discrepancy.

hlinnaka added a commit that referenced this issue Jun 6, 2024
- Split the first and second parts of the test to two separate tests

- In the first test, disable the aggressive GC, compaction, and
  autovacuum. They are only needed by the second test. I'd like to get
  the first test to a point that the VM page is never all-zeros.
  Disabling autovacuum in the first test is hopefully enough
  to accomplish that.

- Compare the full page images, don't skip page header. After fixing
  the previous point, there should be no discrepancy. LSN still won't
  match, though, because of commit 387a368.

Fixes issue #7984
hlinnaka added a commit that referenced this issue Jun 6, 2024
- Split the first and second parts of the test to two separate tests

- In the first test, disable the aggressive GC, compaction, and
  autovacuum. They are only needed by the second test. I'd like to get
  the first test to a point that the VM page is never all-zeros.
  Disabling autovacuum in the first test is hopefully enough
  to accomplish that.

- Compare the full page images, don't skip page header. After fixing
  the previous point, there should be no discrepancy. LSN still won't
  match, though, because of commit 387a368.

Fixes issue #7984
hlinnaka added a commit that referenced this issue Jun 12, 2024
- Split the first and second parts of the test to two separate tests

- In the first test, disable the aggressive GC, compaction, and
autovacuum. They are only needed by the second test. I'd like to get the
first test to a point that the VM page is never all-zeros. Disabling
autovacuum in the first test is hopefully enough to accomplish that.

- Compare the full page images, don't skip page header. After fixing the
previous point, there should be no discrepancy. LSN still won't match,
though, because of commit 387a368.

Fixes issue #7984
@problame problame linked a pull request Jun 12, 2024 that will close this issue
@problame
Copy link
Contributor Author

PR #7989 got merged, we believe it fixed this flakiness.

Other flakiness is still prevalent: #6967

VladLazar pushed a commit that referenced this issue Jun 12, 2024
- Split the first and second parts of the test to two separate tests

- In the first test, disable the aggressive GC, compaction, and
autovacuum. They are only needed by the second test. I'd like to get the
first test to a point that the VM page is never all-zeros. Disabling
autovacuum in the first test is hopefully enough to accomplish that.

- Compare the full page images, don't skip page header. After fixing the
previous point, there should be no discrepancy. LSN still won't match,
though, because of commit 387a368.

Fixes issue #7984
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test/flaky Area: related to flaky tests a/test Area: related to testing c/storage Component: storage t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants