-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wal backpressure performance tests #1891
Conversation
Preliminary results:
There is a noticeable performance drop when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all i noticed that throttling is rarely triggered by this test (with default parameters).
I never saw "backpressure throttling" backend status (but I have to say that throttling sometime happen, because once I have added correspondent trace, I saw such messages in the log).
May be we should increase write workload (produce more WAL) or adjust default parameters for this test.
Seconds I saw three warnings:
Command '['/home/knizhnik/zenith/tmp_install/bin/pgbench', '-s1000', '-i', 'host=localhost port=5432 dbname=postgres options=-cstatement_timeout=120s']' returned non-zero exit status 1.
I wonder if it indicates some error...
I saw that too. It's because a query's timeout is set to be Edit: I only see that log when increasing the test's parameters. I guess the query is timeout because the test runs faster on my machine than yours.
Yeah we should adjust the default parameters to make the throttling happen more often
Can you share the modifications you make? It's helpful for me to use that for reference when making changes |
It could be. Where do you run these tests? I suggest trying to run some benchmarks to understand how fast is your disk. |
I run locally on my Mac M1 air using release build. |
At my laptop test is not completed in 20 minutes. Also printing avg/stddev of read request delay seems to be not enough. It will be much more interesting to see histogram or at least maximal value.
I do not think so. With large number of small update transaction ( Actually to trigger backpressure we should make safekeepers to work faster and pageserver - slower. It cn be achienved by setting fsync=off for safekeeper (disabling fsyncs at page server is not currently supported, them are always performed). But if safekeeper and pageserver are working at the same node (as in case of local tests), then fsyncs performed by safeleeper may significantly afect total system performance At my laptop intensive disk writes may cause several seconds delay for disk operations (reads and writers). At Mac OS/X fsync is very expensive. So here the picture can be even more complicated and obscure. |
I just added corresponding logging to handling replication lag in |
Size of foo table is ~3Mb. With shared_buffers=1MB (default value) it doesn't fir in ompute shared memory and so |
Look like you're running the old version of the tests. Can you update the branch and try again? I have removed the vacuum step when doing I also added the maximum latency read as a report value. Agree that it would be much better to have some kind of visualization. |
That makes sense and also explain the performance difference when running with |
I also noticed significant difference fsync on/off for pgbench -i
|
It looks like New result (from https://app.circleci.com/pipelines/github/neondatabase/neon/7033/workflows/36dcb168-3213-45c5-9820-68434997d4a9/jobs/72231)
Yeah, look like the only place where there is a significant difference between |
Important note: the new tests in this PR (with default parameters) will make |
d5917b1
to
d48141a
Compare
|
The last version of test_wal_backpressure from thang/add-wal-backpressure-tests branch
I wonder whether |
In this PR, I tuned down the parameters a bit and only made some structural changes. Now, most of the changes about the tests' behaviour will be in #1919. I've changed CI workflows in that branch to have a better feedback loop. Also, I have added Heikki's suggestions based and observed the write lag to approach the limit quickly. The tests in that branch are quite broken now, so will need to look into them first. Regarding the |
@knizhnik I think tests work well now on #1919. You can try run them on that branch. An example result from https://app.circleci.com/pipelines/github/neondatabase/neon/7096/workflows/7160f1d4-5a97-40d7-9ebb-6d8b5b07c3b0/jobs/72866:
Most of the read queries have been changed to something simple (e.g, read a single row using primary key). So, the big read latency in some tests is probably because of the write lag. |
test_wal_backpressure on #1919 branch now consumes ore than 200Gb of disk space so I can't run it locally.
But it is hard to say whether such delay is actually caused by wait_lsn and what pageserver is doing during this 50 seconds. My original assumption was that small (10Mb) The next problem which makes me to look for replacement of current stop-and-wait throttling algorithm was that with
But even in this case it is not large enough. So based on the results of this tests there is o strong need in replacement of current simple stop-and-wait throttling algorithm. What else we should try is cache for last written LSN (neondatabase/postgres#167). |
PR (neondatabase/postgres#167) can not help to reduce read latency in test_pgbench_simple_update_workload just because reader access updated table. Actually results of this particular test on my laptop looks really confusing:
Average and maximal delays with |
I understand the reasons of large read latency in test_pgbench_simple_update_workload and other tests and why backpresser doesn;t help much;
With wal_log_hits=on
wal_log_hints=off
|
Is
Good catch! I have updated the tests. in #1919, the max read latency is reduced significantly on my local. |
I'm not so sure. In fact, with only new wal backpressure tests, the total test runtime is increased by ~40min, which is extremely slow. It can probably be helped with parallelism. Edit: if 500Mb write lag is too much, we can remove those tests. However, given the fact that 500MB is the default setting in prod, we should probably wait. |
Hmm, I'm not sure I'm ready to jump to that conclusion. What is the speed that the pageserver can process the incoming WAL? 10 MB/s? 100 MB/s? I remember we did some quick measurements long time ago, and got number 30 MB / s. If we assume conservatively that it's actually 10 MB/s, then a 15 MB max_replication_write_lag should give a max 1.5 seconds delay. So why are are we seeing read latency > 10 seconds for some queries? Is the pageserver getting stuck on a lock sometimes, so that it pauses processing any WAL for 5-10 seconds? Or does the query need to wait for multiple pages, so that the overall response time is > 10 seconds, even though the wait on each page is only 1.5 s. Or what's going on? It would be useful to print out the current last_received_lsn once as second during the test. You could then see the speed that the pageserver is processing the WAL. Is it constant, or is there some pauses? |
Yes
I do not think that we should perform tests with 500Mb max_replication_write_lag.
IMHO we should change prod settings ASAP.
It greatly depends on workload (i.e. FPI from COPY vs. individual insert records, especially updating indexes)
10 MB/sec we get if there is no other activity.
then speed can be decreased several times.
I didn't ever see that pageserver is blocked on some locks. Most frequently I see threads blocks in write/fsync. Also please notice, that originally I say the huge delay problem with new client connection and first execution of some query.
Backpressure tests are printing lag now, adding it to current flush LSN we can monitor speed of consuming WAL by pageserver. |
Got it. Will disable it in tests later.
I have added the code to print |
This is only true if the difference between produce speed and process speed is consistently 10MB/s. IIUC, as a read request requires the latest LSN, theoretically, there can be a situation when produce speed = process speed and the lag is 14 MB. If that happens, the read request'll never finish because the page server keeps failing behind. |
Read request may require latest LSN. But latest at the moment of issuing this request. If soe WAL is produced after it, we should not wait for it. So the situation you have described is never possible. Achilles won't overtake the tortoise: walreceiver may never catch-up wih walsender without some kind of synchronization. But we are solving different problem: read_page_at_lsn doesn't need to catch-up. It just waits for the particular LSN. And even if request has "latest" flag, we still wait for |
Yeah, you're right. The requested page stream only waits for at most Not sure if the delay is because the page service processes page stream messages sequentially (doesn't it?). |
@knizhnik I think I quite get the problem now. It's true that a single Please correct me if I misunderstand something here. In case of Compared to the previous commit, the read latency average for this test is reduced by 10x. However, the read latency and runtime for the |
|
Actually, after reruning the test, there is not much of a difference between the two runs. I guess the previous anomaly happens because CircleCI is under a heavy load. @knizhnik if the tests' behaviours defined in #1919 look good to you, I can merge it back to this branch and we can start working on reviewing this PR and merging it into |
Or we can rebase #1919 and start from there |
Close in favour of #1919 |
Resolves #1889.
This PR adds new tests to measure the WAL backpressure's performance under different workloads.
Changes
test_wal_backpressure.py