Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: could not read block xx in rel xx from pageserver at lsn xx #2257

Open
surquer opened this issue Aug 12, 2022 · 7 comments
Open

error: could not read block xx in rel xx from pageserver at lsn xx #2257

surquer opened this issue Aug 12, 2022 · 7 comments

Comments

@surquer
Copy link

surquer commented Aug 12, 2022

After deploying neon according to the steps described in the README.md, use pgbench to test the performance, and the following error occurs:
neon_error

@SomeoneToIgnore
Copy link
Contributor

Hard to say for sure what happened without seeing pageserver and safekeeper logs, ideally etcd logs either (you can find those in .neon/ directory ) and the workload details.

I would guess, that the pgbench test you've launched might have caused big writes which caused pageserver's walreceiver connection to flicker, hence failing to receive WAL timely.
There's a patch #2253 aiming to fix most of such problems, might be good to check that if you can and want to.

@petuhovskiy
Copy link
Member

I would guess, that the pgbench test you've launched might have caused big writes which caused pageserver's walreceiver connection to flicker, hence failing to receive WAL timely.

That's a plausible theory, but it also could be our usual backpressure issue (see #2028). We've already noticed Timed out while waiting for WAL record when running benchmarks, and it seems that it's still a relevant issue.

@hlinnaka
Copy link
Contributor

Also for performance testing, make sure you're building in release mode:

BUILD_TYPE=release make -j`nproc`

and use the binaries from target/release instead of targer/debug. The debug-mode binaries are significantly slower.

@surquer
Copy link
Author

surquer commented Aug 15, 2022

Also for performance testing, make sure you're building in release mode:

BUILD_TYPE=release make -j`nproc`

and use the binaries from target/release instead of targer/debug. The debug-mode binaries are significantly slower.
Yes, it is tested with the release version

@surquer
Copy link
Author

surquer commented Aug 15, 2022

I would guess, that the pgbench test you've launched might have caused big writes which caused pageserver's walreceiver connection to flicker, hence failing to receive WAL timely.

That's a plausible theory, but it also could be our usual backpressure issue (see #2028). We've already noticed Timed out while waiting for WAL record when running benchmarks, and it seems that it's still a relevant issue.

Yes, it happened while doing the read and write test with lots of reads and writes at the same time

@surquer
Copy link
Author

surquer commented Aug 16, 2022

``> Hard to say for sure what happened without seeing pageserver and safekeeper logs, ideally etcd logs either (you can find those in .neon/ directory ) and the workload details.

I would guess, that the pgbench test you've launched might have caused big writes which caused pageserver's walreceiver connection to flicker, hence failing to receive WAL timely. There's a patch #2253 aiming to fix most of such problems, might be good to check that if you can and want to.

I updated the code and tested it, the problem still exists, there are many "Timed out while waiting for WAL record at" in pageserver.log

pageserver.zip

@knizhnik
Copy link
Contributor

I wonder which backpressure setting you are using?
If it is default max_replication_write_lag=500MB, then such wait timeout errors are expected (or at least known) behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants