Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wal backpressure performance tests #1891

Closed
wants to merge 19 commits into from

Conversation

aome510
Copy link
Contributor

@aome510 aome510 commented Jun 6, 2022

Resolves #1889.

This PR adds new tests to measure the WAL backpressure's performance under different workloads.

Changes

  • add new performance tests in test_wal_backpressure.py
  • allow safekeeper's fsync to be configurable when running tests

@hlinnaka hlinnaka requested a review from knizhnik June 6, 2022 22:04
@aome510
Copy link
Contributor Author

aome510 commented Jun 7, 2022

Preliminary results:

run_duratitrue read_latency_max read_latency_avg read_latency_stdev
heavy_write[vanilla] 8.438 0.114 0.046 0.029
heavy_write[neon, fsync=false] 34.892 2.047 1.137 0.564
heavy_write[neon, fsync=true] 60.459 3.562 1.951 0.851
pgbench_simple_update[vanilla] 2.488 0.404 0.236 0.238
pgbench_simple_update[neon, fsync=false] 6.158 0.578 0.528 0.039
pgbench_simple_update[neon, fsync=true] 48.186 11.407 1.465 2.826
other_table[vanilla] 21.407 0.036 0.012 0.006
other_table[neon, fsync=false] 46.768 0.479 0.135 0.130
other_table[neon, fsync=true] 105.960 20.600 0.630 2.589

There is a noticeable performance drop when fsync is enabled in the safekeeper config (disabled by default for tests). For example, I cannot reproduce the issue in #1763 if fsync is disabled. My best guess is that fsync is the bottleneck here.

Copy link
Contributor

@knizhnik knizhnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all i noticed that throttling is rarely triggered by this test (with default parameters).
I never saw "backpressure throttling" backend status (but I have to say that throttling sometime happen, because once I have added correspondent trace, I saw such messages in the log).

May be we should increase write workload (produce more WAL) or adjust default parameters for this test.

Seconds I saw three warnings:

 Command '['/home/knizhnik/zenith/tmp_install/bin/pgbench', '-s1000', '-i', 'host=localhost port=5432 dbname=postgres options=-cstatement_timeout=120s']' returned non-zero exit status 1.

I wonder if it indicates some error...

@aome510
Copy link
Contributor Author

aome510 commented Jun 7, 2022

Seconds I saw three warnings:

 Command '['/home/knizhnik/zenith/tmp_install/bin/pgbench', '-s1000', '-i', 'host=localhost port=5432 dbname=postgres options=-cstatement_timeout=120s']' returned non-zero exit status 1.

I wonder if it indicates some error...

I saw that too. It's because a query's timeout is set to be 120s. That's one reason why I don't increase the workload too much.

Edit: I only see that log when increasing the test's parameters. I guess the query is timeout because the test runs faster on my machine than yours.

May be we should increase write workload (produce more WAL) or adjust default parameters for this test.

Yeah we should adjust the default parameters to make the throttling happen more often

I never saw "backpressure throttling" backend status (but I have to say that throttling sometime happen, because once I have added correspondent trace, I saw such messages in the log).

Can you share the modifications you make? It's helpful for me to use that for reference when making changes

@petuhovskiy
Copy link
Member

My best guess is that fsync is the bottleneck here.

It could be. Where do you run these tests? I suggest trying to run some benchmarks to understand how fast is your disk. pg_test_fsync should be enough to get basic fsync performance, but you can also run fio and look at https://github.com/neondatabase/perftest-sk/blob/master/common/disk-perf.md for examples.

@aome510
Copy link
Contributor Author

aome510 commented Jun 7, 2022

It could be. Where do you run these tests?

I run locally on my Mac M1 air using release build.

@knizhnik
Copy link
Contributor

knizhnik commented Jun 8, 2022

At my laptop test is not completed in 20 minutes.

Also printing avg/stddev of read request delay seems to be not enough. It will be much more interesting to see histogram or at least maximal value.

My best guess is that fsync is the bottleneck here.

I do not think so.
First of all in case of bulk load (copy, i.e. pgbench -i) all changes are inserted inside one transaction. fsync is rarely performed.
Performance is limited by disk write speed.

With large number of small update transaction (pgbehch -N) situation is quite different. Please notice that to commit transaction compute node should want until it is synced by all safekeepers. If safe keeper is slow then we get small transaction rate. But there should not be a big replication lag. Assume that you run pgbench -c 10 -N. In this case there can be up to ten concurrently executed transactions. The LSN lag between compute node and safekeeper should not be arge than 10*transaction_size, in where transaction size i this case is very small ~100 bytes.

Actually to trigger backpressure we should make safekeepers to work faster and pageserver - slower. It cn be achienved by setting fsync=off for safekeeper (disabling fsyncs at page server is not currently supported, them are always performed).
In this case transaction will be committed faster and appended to the safekeeper log, but pageserver can not consume them with the same rate).

But if safekeeper and pageserver are working at the same node (as in case of local tests), then fsyncs performed by safeleeper may significantly afect total system performance At my laptop intensive disk writes may cause several seconds delay for disk operations (reads and writers). At Mac OS/X fsync is very expensive. So here the picture can be even more complicated and obscure.

@knizhnik
Copy link
Contributor

knizhnik commented Jun 8, 2022

Can you share the modifications you make? It's helpful for me to use that for reference when making changes

I just added corresponding logging to handling replication lag in ProcessInterrupts. This messages are printed to pg.log. It is not so convenient to monitor them from tests. May be we should just periodically check backends status (pg_stat_activity) for backpressure throttling and also output current replication lag usig backpressure_lsns() and pg_current_wal_flush_lsn () functions.

@knizhnik
Copy link
Contributor

knizhnik commented Jun 8, 2022

Size of foo table is ~3Mb. With shared_buffers=1MB (default value) it doesn't fir in ompute shared memory and so select count(*_ from foo cause fetching pages fro pageserver. But with large value of shared buffers (i.e. 128Mb used at staging and production), this query may not cause access to page server at all. So our test will not notice delay caused by replication lag.

@aome510
Copy link
Contributor Author

aome510 commented Jun 8, 2022

At my laptop test is not completed in 20 minutes.

Also printing avg/stddev of read request delay seems to be not enough. It will be much more interesting to see histogram or at least maximal value.

Look like you're running the old version of the tests. Can you update the branch and try again? I have removed the vacuum step when doing pgbench -s100 -i. Without that, it will run for a really long time (> 5min in my laptop). Not sure why it's that slow compared to vanilla, 4x slower read time doesn't seem to justify the scale, maybe there are other things, idk.

I also added the maximum latency read as a report value. Agree that it would be much better to have some kind of visualization.

@aome510
Copy link
Contributor Author

aome510 commented Jun 8, 2022

But if safekeeper and pageserver are working at the same node (as in case of local tests), then fsyncs performed by safeleeper may significantly afect total system performance At my laptop intensive disk writes may cause several seconds delay for disk operations (reads and writers). At Mac OS/X fsync is very expensive. So here the picture can be even more complicated and obscure.

That makes sense and also explain the performance difference when running with fsync on/off. I'll revert the changes (that adds fsync when running the tests) and do some other experiments.

@knizhnik
Copy link
Contributor

knizhnik commented Jun 8, 2022

I also noticed significant difference fsync on/off for pgbench -i
Looks like I was wrong: looks like 128kb (maximal size of wal sender message) is not large enough to eliminate overhead of fsyncing. As far as copy is done inside one transaction, sync replication wait can not slowdown walsender. So in this case bottleneck causing replication lag is safekeeper, not pageserver.

test_measure_read_latency_heavy_write_workload[vanilla-10-1].run_duration: 13.895 s
test_measure_read_latency_heavy_write_workload[vanilla-10-1].read_latency_max: 0.306 s
test_measure_read_latency_heavy_write_workload[vanilla-10-1].read_latency_avg: 0.139 s
test_measure_read_latency_heavy_write_workload[vanilla-10-1].read_latency_stdev: 0.112 s
test_measure_read_latency_heavy_write_workload[neon_fsync_off-10-1].run_duration: 36.132 s
test_measure_read_latency_heavy_write_workload[neon_fsync_off-10-1].read_latency_max: 1.774 s
test_measure_read_latency_heavy_write_workload[neon_fsync_off-10-1].read_latency_avg: 1.096 s
test_measure_read_latency_heavy_write_workload[neon_fsync_off-10-1].read_latency_stdev: 0.485 s
test_measure_read_latency_heavy_write_workload[neon_fsync_on-10-1].run_duration: 61.670 s
test_measure_read_latency_heavy_write_workload[neon_fsync_on-10-1].read_latency_max: 4.779 s
test_measure_read_latency_heavy_write_workload[neon_fsync_on-10-1].read_latency_avg: 2.171 s
test_measure_read_latency_heavy_write_workload[neon_fsync_on-10-1].read_latency_stdev: 1.352 s
test_measure_read_latency_pgbench_simple_update_workload[vanilla-1000-10].run_duration: 12.488 s
test_measure_read_latency_pgbench_simple_update_workload[vanilla-1000-10].read_latency_max: 6.474 s
test_measure_read_latency_pgbench_simple_update_workload[vanilla-1000-10].read_latency_avg: 1.490 s
test_measure_read_latency_pgbench_simple_update_workload[vanilla-1000-10].read_latency_stdev: 2.793 s
test_measure_read_latency_pgbench_simple_update_workload[neon_fsync_off-1000-10].run_duration: 7.886 s
test_measure_read_latency_pgbench_simple_update_workload[neon_fsync_off-1000-10].read_latency_max: 0.642 s
test_measure_read_latency_pgbench_simple_update_workload[neon_fsync_off-1000-10].read_latency_avg: 0.569 s
test_measure_read_latency_pgbench_simple_update_workload[neon_fsync_off-1000-10].read_latency_stdev: 0.049 s
test_measure_read_latency_pgbench_simple_update_workload[neon_fsync_on-1000-10].run_duration: 26.541 s
test_measure_read_latency_pgbench_simple_update_workload[neon_fsync_on-1000-10].read_latency_max: 4.353 s
test_measure_read_latency_pgbench_simple_update_workload[neon_fsync_on-1000-10].read_latency_avg: 1.988 s
test_measure_read_latency_pgbench_simple_update_workload[neon_fsync_on-1000-10].read_latency_stdev: 1.658 s
test_measure_read_latency_other_table[vanilla-100].run_duration: 130.723 s
test_measure_read_latency_other_table[vanilla-100].read_latency_max: 1.199 s
test_measure_read_latency_other_table[vanilla-100].read_latency_avg: 0.067 s
test_measure_read_latency_other_table[vanilla-100].read_latency_stdev: 0.164 s
test_measure_read_latency_other_table[neon_fsync_off-100].run_duration: 115.215 s
test_measure_read_latency_other_table[neon_fsync_off-100].read_latency_max: 8.140 s
test_measure_read_latency_other_table[neon_fsync_off-100].read_latency_avg: 0.377 s
test_measure_read_latency_other_table[neon_fsync_off-100].read_latency_stdev: 1.081 s
test_measure_read_latency_other_table[neon_fsync_on-100].run_duration: 242.563 s
test_measure_read_latency_other_table[neon_fsync_on-100].read_latency_max: 40.526 s
test_measure_read_latency_other_table[neon_fsync_on-100].read_latency_avg: 0.890 s
test_measure_read_latency_other_table[neon_fsync_on-100].read_latency_stdev: 3.844 s

@aome510 aome510 changed the title Add wal backpressure performance tests (WIP) Add wal backpressure performance tests Jun 8, 2022
@aome510 aome510 marked this pull request as ready for review June 8, 2022 21:38
@aome510
Copy link
Contributor Author

aome510 commented Jun 8, 2022

It looks like fsync running in CI is quite fast compared to my local, so I increase the tests' default parameters to get more interesting results.

New result (from https://app.circleci.com/pipelines/github/neondatabase/neon/7033/workflows/36dcb168-3213-45c5-9820-68434997d4a9/jobs/72231)

test_heavy_write_workload[vanilla-10-5].run_duration: 10.666 s
test_heavy_write_workload[vanilla-10-5].read_latency_max: 0.123 s
test_heavy_write_workload[vanilla-10-5].read_latency_avg: 0.061 s
test_heavy_write_workload[vanilla-10-5].read_latency_stdev: 0.039 s
test_heavy_write_workload[neon_fsync_off-10-5].lsn_write_lag_max: 4296 kB
test_heavy_write_workload[neon_fsync_off-10-5].lsn_write_lag_avg: 1862 kB
test_heavy_write_workload[neon_fsync_off-10-5].lsn_write_lag_stdev: 824 kB
test_heavy_write_workload[neon_fsync_off-10-5].run_duration: 34.765 s
test_heavy_write_workload[neon_fsync_off-10-5].read_latency_max: 8.415 s
test_heavy_write_workload[neon_fsync_off-10-5].read_latency_avg: 3.313 s
test_heavy_write_workload[neon_fsync_off-10-5].read_latency_stdev: 2.660 s
test_heavy_write_workload[neon_fsync_on-10-5].lsn_write_lag_max: 3296 kB
test_heavy_write_workload[neon_fsync_on-10-5].lsn_write_lag_avg: 1729 kB
test_heavy_write_workload[neon_fsync_on-10-5].lsn_write_lag_stdev: 734 kB
test_heavy_write_workload[neon_fsync_on-10-5].run_duration: 36.363 s
test_heavy_write_workload[neon_fsync_on-10-5].read_latency_max: 10.264 s
test_heavy_write_workload[neon_fsync_on-10-5].read_latency_avg: 3.497 s
test_heavy_write_workload[neon_fsync_on-10-5].read_latency_stdev: 3.366 s
test_pgbench_simple_update_workload[vanilla-1000-100].run_duration: 19.652 s
test_pgbench_simple_update_workload[vanilla-1000-100].read_latency_max: 12.901 s
test_pgbench_simple_update_workload[vanilla-1000-100].read_latency_avg: 3.909 s
test_pgbench_simple_update_workload[vanilla-1000-100].read_latency_stdev: 6.084 s
test_pgbench_simple_update_workload[neon_fsync_off-1000-100].lsn_write_lag_max: 19431 kB
test_pgbench_simple_update_workload[neon_fsync_off-1000-100].lsn_write_lag_avg: 1871 kB
test_pgbench_simple_update_workload[neon_fsync_off-1000-100].lsn_write_lag_stdev: 4491 kB
test_pgbench_simple_update_workload[neon_fsync_off-1000-100].run_duration: 99.212 s
test_pgbench_simple_update_workload[neon_fsync_off-1000-100].read_latency_max: 44.542 s
test_pgbench_simple_update_workload[neon_fsync_off-1000-100].read_latency_avg: 18.813 s
test_pgbench_simple_update_workload[neon_fsync_off-1000-100].read_latency_stdev: 17.174 s
test_pgbench_simple_update_workload[neon_fsync_on-1000-100].lsn_write_lag_max: 19802 kB
test_pgbench_simple_update_workload[neon_fsync_on-1000-100].lsn_write_lag_avg: 2446 kB
test_pgbench_simple_update_workload[neon_fsync_on-1000-100].lsn_write_lag_stdev: 4973 kB
test_pgbench_simple_update_workload[neon_fsync_on-1000-100].run_duration: 154.156 s
test_pgbench_simple_update_workload[neon_fsync_on-1000-100].read_latency_max: 55.226 s
test_pgbench_simple_update_workload[neon_fsync_on-1000-100].read_latency_avg: 24.500 s
test_pgbench_simple_update_workload[neon_fsync_on-1000-100].read_latency_stdev: 23.025 s
test_pgbench_intensive_init_workload[vanilla-1000].run_duration: 122.360 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_max: 0.020 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_avg: 0.007 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_stdev: 0.002 s
test_pgbench_intensive_init_workload[neon_fsync_off-1000].lsn_write_lag_max: 15460 kB
test_pgbench_intensive_init_workload[neon_fsync_off-1000].lsn_write_lag_avg: 2287 kB
test_pgbench_intensive_init_workload[neon_fsync_off-1000].lsn_write_lag_stdev: 3236 kB
test_pgbench_intensive_init_workload[neon_fsync_off-1000].run_duration: 193.829 s
test_pgbench_intensive_init_workload[neon_fsync_off-1000].read_latency_max: 2.508 s
test_pgbench_intensive_init_workload[neon_fsync_off-1000].read_latency_avg: 0.323 s
test_pgbench_intensive_init_workload[neon_fsync_off-1000].read_latency_stdev: 0.468 s
test_pgbench_intensive_init_workload[neon_fsync_on-1000].lsn_write_lag_max: 15467 kB
test_pgbench_intensive_init_workload[neon_fsync_on-1000].lsn_write_lag_avg: 12968 kB
test_pgbench_intensive_init_workload[neon_fsync_on-1000].lsn_write_lag_stdev: 3889 kB
test_pgbench_intensive_init_workload[neon_fsync_on-1000].run_duration: 123.336 s
test_pgbench_intensive_init_workload[neon_fsync_on-1000].read_latency_max: 19.415 s
test_pgbench_intensive_init_workload[neon_fsync_on-1000].read_latency_avg: 9.231 s
test_pgbench_intensive_init_workload[neon_fsync_on-1000].read_latency_stdev: 8.565 s

As far as copy is done inside one transaction, sync replication wait can not slowdown walsender. So in this case bottleneck causing replication lag is safekeeper, not pageserver.

Yeah, look like the only place where there is a significant difference between fsync=true and fsync=false is in test_pgbench_intensive_init_workload. In the other two, the performance with fsync on is still worse than fsync off, but not too much.

@aome510
Copy link
Contributor Author

aome510 commented Jun 8, 2022

Important note: the new tests in this PR (with default parameters) will make benchmark workflow 1.5 times slower from ~20min to ~30min.

@aome510 aome510 force-pushed the thang/add-wal-backpressure-tests branch from d5917b1 to d48141a Compare June 8, 2022 22:27
@knizhnik
Copy link
Contributor

test_measure_read_latency_heavy_write_workload[vanilla-1].run_duration: 1,503.690 s
test_measure_read_latency_heavy_write_workload[vanilla-1].read_latency_avg: 0.047 s
test_measure_read_latency_heavy_write_workload[vanilla-1].read_latency_stdev: 0.192 s
test_measure_read_latency_heavy_write_workload[neon-1].run_duration: 1,350.481 s
test_measure_read_latency_heavy_write_workload[neon-1].read_latency_avg: 6.367 s
test_measure_read_latency_heavy_write_workload[neon-1].read_latency_stdev: 3.747 s
test_measure_read_latency_pgbench_simple_update_workload[vanilla-100000-10].run_duration: 122.362 s
test_measure_read_latency_pgbench_simple_update_workload[vanilla-100000-10].read_latency_avg: 0.026 s
test_measure_read_latency_pgbench_simple_update_workload[vanilla-100000-10].read_latency_stdev: 0.110 s
test_measure_read_latency_pgbench_simple_update_workload[neon-100000-10].run_duration: 127.898 s
test_measure_read_latency_pgbench_simple_update_workload[neon-100000-10].read_latency_avg: 0.020 s
test_measure_read_latency_pgbench_simple_update_workload[neon-100000-10].read_latency_stdev: 0.010 s
test_measure_read_latency_other_table[vanilla-1000].run_duration: 143.355 s
test_measure_read_latency_other_table[vanilla-1000].read_latency_avg: 0.088 s
test_measure_read_latency_other_table[vanilla-1000].read_latency_stdev: 0.206 s
test_measure_read_latency_other_table[neon-1000].run_duration: 138.211 s
test_measure_read_latency_other_table[neon-1000].read_latency_avg: 1.302 s
test_measure_read_latency_other_table[neon-1000].read_latency_stdev: 2.775 s

@knizhnik
Copy link
Contributor

The last version of test_wal_backpressure from thang/add-wal-backpressure-tests branch
do not shows large read delays, but 2 tests are failed with error:

FAILED test_runner/performance/test_wal_backpressure.py::test_pgbench_simple_update_workload[neon_off_500MB-10000-10] - psycopg2.DatabaseError: no commit timestamp...
FAILED test_runner/performance/test_wal_backpressure.py::test_pgbench_simple_update_workload[neon_on_500MB-10000-10] - psycopg2.DatabaseError: no commit timestamps...

I wonder whether find_lsn_for_timestamp should bail if it doesn't find any timestamp (may because database is empty)?

@aome510
Copy link
Contributor Author

aome510 commented Jun 10, 2022

The last version of test_wal_backpressure from thang/add-wal-backpressure-tests branch do not shows large read delays, but 2 tests are failed with error:

FAILED test_runner/performance/test_wal_backpressure.py::test_pgbench_simple_update_workload[neon_off_500MB-10000-10] - psycopg2.DatabaseError: no commit timestamp...
FAILED test_runner/performance/test_wal_backpressure.py::test_pgbench_simple_update_workload[neon_on_500MB-10000-10] - psycopg2.DatabaseError: no commit timestamps...

I wonder whether find_lsn_for_timestamp should bail if it doesn't find any timestamp (may because database is empty)?

In this PR, I tuned down the parameters a bit and only made some structural changes.

Now, most of the changes about the tests' behaviour will be in #1919. I've changed CI workflows in that branch to have a better feedback loop. Also, I have added Heikki's suggestions based and observed the write lag to approach the limit quickly. The tests in that branch are quite broken now, so will need to look into them first.

Regarding the psycopg2.DatabaseError: no commit timestamps..., I haven't seen that before. I'll try to reproduce it locally.

@aome510
Copy link
Contributor Author

aome510 commented Jun 10, 2022

@knizhnik I think tests work well now on #1919. You can try run them on that branch.

An example result from https://app.circleci.com/pipelines/github/neondatabase/neon/7096/workflows/7160f1d4-5a97-40d7-9ebb-6d8b5b07c3b0/jobs/72866:

test_heavy_write_workload[vanilla-10-5-5].run_duration: 31.004 s
test_heavy_write_workload[vanilla-10-5-5].read_latency_max: 0.014 s
test_heavy_write_workload[vanilla-10-5-5].read_latency_avg: 0.005 s
test_heavy_write_workload[vanilla-10-5-5].read_latency_stdev: 0.002 s
test_heavy_write_workload[neon_off_15MB-10-5-5].run_duration: 186.006 s
test_heavy_write_workload[neon_off_15MB-10-5-5].lsn_write_lag_max: 16936 kB
test_heavy_write_workload[neon_off_15MB-10-5-5].lsn_write_lag_avg: 14737 kB
test_heavy_write_workload[neon_off_15MB-10-5-5].lsn_write_lag_stdev: 3102 kB
test_heavy_write_workload[neon_off_15MB-10-5-5].read_latency_max: 12.382 s
test_heavy_write_workload[neon_off_15MB-10-5-5].read_latency_avg: 4.346 s
test_heavy_write_workload[neon_off_15MB-10-5-5].read_latency_stdev: 3.121 s
test_heavy_write_workload[neon_off_500MB-10-5-5].run_duration: 183.251 s
test_heavy_write_workload[neon_off_500MB-10-5-5].lsn_write_lag_max: 466755 kB
test_heavy_write_workload[neon_off_500MB-10-5-5].lsn_write_lag_avg: 137283 kB
test_heavy_write_workload[neon_off_500MB-10-5-5].lsn_write_lag_stdev: 97640 kB
test_heavy_write_workload[neon_off_500MB-10-5-5].read_latency_max: 21.312 s
test_heavy_write_workload[neon_off_500MB-10-5-5].read_latency_avg: 15.360 s
test_heavy_write_workload[neon_off_500MB-10-5-5].read_latency_stdev: 5.996 s
test_heavy_write_workload[neon_on_15MB-10-5-5].run_duration: 208.458 s
test_heavy_write_workload[neon_on_15MB-10-5-5].lsn_write_lag_max: 17541 kB
test_heavy_write_workload[neon_on_15MB-10-5-5].lsn_write_lag_avg: 14406 kB
test_heavy_write_workload[neon_on_15MB-10-5-5].lsn_write_lag_stdev: 3418 kB
test_heavy_write_workload[neon_on_15MB-10-5-5].read_latency_max: 15.585 s
test_heavy_write_workload[neon_on_15MB-10-5-5].read_latency_avg: 6.489 s
test_heavy_write_workload[neon_on_15MB-10-5-5].read_latency_stdev: 4.034 s
test_heavy_write_workload[neon_on_500MB-10-5-5].run_duration: 205.617 s
test_heavy_write_workload[neon_on_500MB-10-5-5].lsn_write_lag_max: 513389 kB
test_heavy_write_workload[neon_on_500MB-10-5-5].lsn_write_lag_avg: 187969 kB
test_heavy_write_workload[neon_on_500MB-10-5-5].lsn_write_lag_stdev: 145143 kB
test_heavy_write_workload[neon_on_500MB-10-5-5].read_latency_max: 23.683 s
test_heavy_write_workload[neon_on_500MB-10-5-5].read_latency_avg: 16.557 s
test_heavy_write_workload[neon_on_500MB-10-5-5].read_latency_stdev: 6.815 s
test_pgbench_simple_update_workload[vanilla-45-100].run_duration: 65.209 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_max: 13.288 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_avg: 0.347 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_stdev: 1.927 s
test_pgbench_simple_update_workload[neon_off_15MB-45-100].run_duration: 131.969 s
test_pgbench_simple_update_workload[neon_off_15MB-45-100].lsn_write_lag_max: 25241 kB
test_pgbench_simple_update_workload[neon_off_15MB-45-100].lsn_write_lag_avg: 1618 kB
test_pgbench_simple_update_workload[neon_off_15MB-45-100].lsn_write_lag_stdev: 4360 kB
test_pgbench_simple_update_workload[neon_off_15MB-45-100].read_latency_max: 35.336 s
test_pgbench_simple_update_workload[neon_off_15MB-45-100].read_latency_avg: 1.701 s
test_pgbench_simple_update_workload[neon_off_15MB-45-100].read_latency_stdev: 6.336 s
test_pgbench_simple_update_workload[neon_off_500MB-45-100].run_duration: 139.366 s
test_pgbench_simple_update_workload[neon_off_500MB-45-100].lsn_write_lag_max: 90112 kB
test_pgbench_simple_update_workload[neon_off_500MB-45-100].lsn_write_lag_avg: 1106 kB
test_pgbench_simple_update_workload[neon_off_500MB-45-100].lsn_write_lag_stdev: 7601 kB
test_pgbench_simple_update_workload[neon_off_500MB-45-100].read_latency_max: 30.133 s
test_pgbench_simple_update_workload[neon_off_500MB-45-100].read_latency_avg: 1.377 s
test_pgbench_simple_update_workload[neon_off_500MB-45-100].read_latency_stdev: 5.386 s
test_pgbench_simple_update_workload[neon_on_15MB-45-100].run_duration: 183.948 s
test_pgbench_simple_update_workload[neon_on_15MB-45-100].lsn_write_lag_max: 18951 kB
test_pgbench_simple_update_workload[neon_on_15MB-45-100].lsn_write_lag_avg: 1523 kB
test_pgbench_simple_update_workload[neon_on_15MB-45-100].lsn_write_lag_stdev: 3703 kB
test_pgbench_simple_update_workload[neon_on_15MB-45-100].read_latency_max: 49.567 s
test_pgbench_simple_update_workload[neon_on_15MB-45-100].read_latency_avg: 2.419 s
test_pgbench_simple_update_workload[neon_on_15MB-45-100].read_latency_stdev: 9.258 s
test_pgbench_simple_update_workload[neon_on_500MB-45-100].run_duration: 156.513 s
test_pgbench_simple_update_workload[neon_on_500MB-45-100].read_latency_max: 35.752 s
test_pgbench_simple_update_workload[neon_on_500MB-45-100].read_latency_avg: 1.854 s
test_pgbench_simple_update_workload[neon_on_500MB-45-100].read_latency_stdev: 7.063 s
test_pgbench_simple_update_workload[neon_on_500MB-45-100].lsn_write_lag_max: 512040 kB
test_pgbench_simple_update_workload[neon_on_500MB-45-100].lsn_write_lag_avg: 16648 kB
test_pgbench_simple_update_workload[neon_on_500MB-45-100].lsn_write_lag_stdev: 76113 kB
test_pgbench_intensive_init_workload[vanilla-1000].run_duration: 119.182 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_max: 0.106 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_avg: 0.013 s
test_pgbench_intensive_init_workload[vanilla-1000].read_latency_stdev: 0.016 s
test_pgbench_intensive_init_workload[neon_off_15MB-1000].run_duration: 66.136 s
test_pgbench_intensive_init_workload[neon_off_15MB-1000].lsn_write_lag_max: 5734 kB
test_pgbench_intensive_init_workload[neon_off_15MB-1000].lsn_write_lag_avg: 1687 kB
test_pgbench_intensive_init_workload[neon_off_15MB-1000].lsn_write_lag_stdev: 1614 kB
test_pgbench_intensive_init_workload[neon_off_15MB-1000].read_latency_max: 2.846 s
test_pgbench_intensive_init_workload[neon_off_15MB-1000].read_latency_avg: 0.865 s
test_pgbench_intensive_init_workload[neon_off_15MB-1000].read_latency_stdev: 0.488 s
test_pgbench_intensive_init_workload[neon_off_500MB-1000].run_duration: 62.151 s
test_pgbench_intensive_init_workload[neon_off_500MB-1000].read_latency_max: 16.530 s
test_pgbench_intensive_init_workload[neon_off_500MB-1000].read_latency_avg: 1.507 s
test_pgbench_intensive_init_workload[neon_off_500MB-1000].read_latency_stdev: 3.165 s
test_pgbench_intensive_init_workload[neon_off_500MB-1000].lsn_write_lag_max: 455565 kB
test_pgbench_intensive_init_workload[neon_off_500MB-1000].lsn_write_lag_avg: 70809 kB
test_pgbench_intensive_init_workload[neon_off_500MB-1000].lsn_write_lag_stdev: 132950 kB
test_pgbench_intensive_init_workload[neon_on_15MB-1000].run_duration: 122.887 s
test_pgbench_intensive_init_workload[neon_on_15MB-1000].lsn_write_lag_max: 15465 kB
test_pgbench_intensive_init_workload[neon_on_15MB-1000].lsn_write_lag_avg: 12467 kB
test_pgbench_intensive_init_workload[neon_on_15MB-1000].lsn_write_lag_stdev: 4096 kB
test_pgbench_intensive_init_workload[neon_on_15MB-1000].read_latency_max: 20.566 s
test_pgbench_intensive_init_workload[neon_on_15MB-1000].read_latency_avg: 7.853 s
test_pgbench_intensive_init_workload[neon_on_15MB-1000].read_latency_stdev: 7.159 s
test_pgbench_intensive_init_workload[neon_on_500MB-1000].run_duration: 115.814 s
test_pgbench_intensive_init_workload[neon_on_500MB-1000].lsn_write_lag_max: 512105 kB
test_pgbench_intensive_init_workload[neon_on_500MB-1000].lsn_write_lag_avg: 492815 kB
test_pgbench_intensive_init_workload[neon_on_500MB-1000].lsn_write_lag_stdev: 64286 kB
test_pgbench_intensive_init_workload[neon_on_500MB-1000].read_latency_max: 115.978 s
test_pgbench_intensive_init_workload[neon_on_500MB-1000].read_latency_avg: 58.409 s
test_pgbench_intensive_init_workload[neon_on_500MB-1000].read_latency_stdev: 81.415 s

Most of the read queries have been changed to something simple (e.g, read a single row using primary key). So, the big read latency in some tests is probably because of the write lag.

@knizhnik
Copy link
Contributor

test_wal_backpressure on #1919 branch now consumes ore than 200Gb of disk space so I can't run it locally.
So we have now a lot of tests for backpressure. But what's next?
This tests proves well known fact that 500Mb write lag is too much and may cause long delays (minutes).
With 15Mb write lag delays are much smaller except this:

test_pgbench_simple_update_workload[neon_on_15MB-45-100].read_latency_max: 49.567 s

But it is hard to say whether such delay is actually caused by wait_lsn and what pageserver is doing during this 50 seconds.
May be node was just overloaded with work.

My original assumption was that small (10Mb) max_write_relication_lag allows to eliminate huge delays (>10 sec).
It's a pity if it is not true.

The next problem which makes me to look for replacement of current stop-and-wait throttling algorithm was that with max_replication_write_lag write speed was about 3 times slower.
This tests do not prove it: only in one case there is noticeable difference:

test_pgbench_simple_update_workload[neon_on_15MB-45-100].run_duration: 183.948 s
test_pgbench_simple_update_workload[neon_on_500MB-45-100].run_duration: 156.513 s

But even in this case it is not large enough. So based on the results of this tests there is o strong need in replacement of current simple stop-and-wait throttling algorithm.

What else we should try is cache for last written LSN (neondatabase/postgres#167).
It is alternative way (to backpressure) to fix problem with read latency.

@knizhnik
Copy link
Contributor

PR (neondatabase/postgres#167) can not help to reduce read latency in test_pgbench_simple_update_workload just because reader access updated table.

Actually results of this particular test on my laptop looks really confusing:

test_pgbench_simple_update_workload[vanilla-45-100].run_duration: 87.705 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_max: 33.317 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_avg: 0.752 s
test_pgbench_simple_update_workload[vanilla-45-100].read_latency_stdev: 4.714 s
test_pgbench_simple_update_workload[neon_off_15MB-45-100].run_duration: 178.398 s
test_pgbench_simple_update_workload[neon_off_15MB-45-100].read_latency_max: 56.333 s
test_pgbench_simple_update_workload[neon_off_15MB-45-100].read_latency_avg: 2.029 s
test_pgbench_simple_update_workload[neon_off_15MB-45-100].read_latency_stdev: 8.736 s
test_pgbench_simple_update_workload[neon_off_15MB-45-100].lsn_write_lag_max: 16322 kB
test_pgbench_simple_update_workload[neon_off_15MB-45-100].lsn_write_lag_avg: 1055 kB
test_pgbench_simple_update_workload[neon_off_15MB-45-100].lsn_write_lag_stdev: 3006 kB
test_pgbench_simple_update_workload[neon_off_500MB-45-100].run_duration: 164.120 s
test_pgbench_simple_update_workload[neon_off_500MB-45-100].read_latency_max: 38.964 s
test_pgbench_simple_update_workload[neon_off_500MB-45-100].read_latency_avg: 1.783 s
test_pgbench_simple_update_workload[neon_off_500MB-45-100].read_latency_stdev: 6.944 s
test_pgbench_simple_update_workload[neon_off_500MB-45-100].lsn_write_lag_max: 65924 kB
test_pgbench_simple_update_workload[neon_off_500MB-45-100].lsn_write_lag_avg: 1294 kB
test_pgbench_simple_update_workload[neon_off_500MB-45-100].lsn_write_lag_stdev: 6521 kB
test_pgbench_simple_update_workload[neon_on_15MB-45-100].lsn_write_lag_max: 15809 kB
test_pgbench_simple_update_workload[neon_on_15MB-45-100].lsn_write_lag_avg: 2040 kB
test_pgbench_simple_update_workload[neon_on_15MB-45-100].lsn_write_lag_stdev: 4684 kB
test_pgbench_simple_update_workload[neon_on_15MB-45-100].read_latency_max: 107.208 s
test_pgbench_simple_update_workload[neon_on_15MB-45-100].read_latency_avg: 50.435 s
test_pgbench_simple_update_workload[neon_on_15MB-45-100].read_latency_stdev: 53.878 s
test_pgbench_simple_update_workload[neon_on_500MB-45-100].lsn_write_lag_max: 512448 kB
test_pgbench_simple_update_workload[neon_on_500MB-45-100].lsn_write_lag_avg: 76596 kB
test_pgbench_simple_update_workload[neon_on_500MB-45-100].lsn_write_lag_stdev: 178892 kB
test_pgbench_simple_update_workload[neon_on_500MB-45-100].read_latency_max: 128.584 s
test_pgbench_simple_update_workload[neon_on_500MB-45-100].read_latency_avg: 55.517 s
test_pgbench_simple_update_workload[neon_on_500MB-45-100].read_latency_stdev: 66.059 s

Average and maximal delays with max_replication_write_lag=15MB are even larger then with max_replication_write_lag=500MB , although lag in the last case is really much larger.
So look like lag value limit itself is not enough warranty of small latency. At least at my laptop.

@knizhnik
Copy link
Contributor

I understand the reasons of large read latency in test_pgbench_simple_update_workload and other tests and why backpresser doesn;t help much;

  1. By default wal_log_hints=on, to them are inserted into WAL without XID (by vacuum or select). Backpressure is not performed for backends having not XID, so it is not jelping in this case.
  2. This test performes "SELECT * from pgbench_accounts where aid = 1" query in separate thread but this query is also executed BEFORE index is constructed, so it required sequential scan. This is why read_latency_max is high.

With wal_log_hits=on

test_pgbench_simple_update_workload[neon_on_15MB-45-100].read_latency_avg: 50.435 s

wal_log_hints=off

test_pgbench_simple_update_workload[neon_on_15MB-45-100].read_latency_avg: 2.340 s

@aome510
Copy link
Contributor Author

aome510 commented Jun 13, 2022

  1. By default wal_log_hints=on, to them are inserted into WAL without XID (by vacuum or select). Backpressure is not performed for backends having not XID, so it is not jelping in this case.

Is wal_log_hints=on enabled in prod? Should we disable wal_log_hints for the wal backpressure tests?

  1. This test performes "SELECT * from pgbench_accounts where aid = 1" query in separate thread but this query is also executed BEFORE index is constructed, so it required sequential scan. This is why read_latency_max is high.

Good catch! I have updated the tests. in #1919, the max read latency is reduced significantly on my local.

@aome510
Copy link
Contributor Author

aome510 commented Jun 13, 2022

So we have now a lot of tests for backpressure. But what's next?
This tests proves well known fact that 500Mb write lag is too much and may cause long delays (minutes).

I'm not so sure. In fact, with only new wal backpressure tests, the total test runtime is increased by ~40min, which is extremely slow. It can probably be helped with parallelism.

Edit: if 500Mb write lag is too much, we can remove those tests. However, given the fact that 500MB is the default setting in prod, we should probably wait.

@hlinnaka
Copy link
Contributor

@hlinnaka I tried your test with required relcache improvements at my laptop. Its execution takes so long time (so it is actually based on old version of test_wal_bckpressure) but latency is not so horrible, although >10 seconds delay seems to be not acceptable. It means that even 15Mb max_replication_write_lag is too larger value.

Hmm, I'm not sure I'm ready to jump to that conclusion. What is the speed that the pageserver can process the incoming WAL? 10 MB/s? 100 MB/s? I remember we did some quick measurements long time ago, and got number 30 MB / s. If we assume conservatively that it's actually 10 MB/s, then a 15 MB max_replication_write_lag should give a max 1.5 seconds delay. So why are are we seeing read latency > 10 seconds for some queries?

Is the pageserver getting stuck on a lock sometimes, so that it pauses processing any WAL for 5-10 seconds? Or does the query need to wait for multiple pages, so that the overall response time is > 10 seconds, even though the wait on each page is only 1.5 s. Or what's going on?

It would be useful to print out the current last_received_lsn once as second during the test. You could then see the speed that the pageserver is processing the WAL. Is it constant, or is there some pauses?

@knizhnik
Copy link
Contributor

Is wal_log_hints=on enabled in prod?

Yes

In fact, with only new wal backpressure tests, the total test runtime is increased by ~40min, which is extremely slow.

I do not think that we should perform tests with 500Mb max_replication_write_lag.

However, given the fact that 500MB is the default setting in prod, we should probably wait.

IMHO we should change prod settings ASAP.

Hmm, I'm not sure I'm ready to jump to that conclusion. What is the speed that the pageserver can process the incoming WAL?

It greatly depends on workload (i.e. FPI from COPY vs. individual insert records, especially updating indexes)
and disk IO speed. But looks like 10Mb/sec is good estimation of write consumption speed.

So why are are we seeing read latency > 10 seconds for some queries?

10 MB/sec we get if there is no other activity.
If we there are active:

  • autovacuum
  • checkpointing (writing frozen lyers)
  • flushing layers
  • compaction

then speed can be decreased several times.
I tried to find out which of the activities above have the most impact on performance.
But looks like t my laptop the main showstopper is SSD itself: in case of intensive writes sometimes IO operations (fsyncs, writes and even reads) are sometimes blocked for dozen of seconds. At server hardware such effect is less noticeable.

Is the pageserver getting stuck on a lock sometimes, so that it pauses processing any WAL for 5-10 seconds? Or does the query need to wait for multiple pages, so that the overall response time is > 10 seconds, even though the wait on each page is only 1.5 s. Or what's going on?

I didn't ever see that pageserver is blocked on some locks. Most frequently I see threads blocks in write/fsync.

Also please notice, that originally I say the huge delay problem with new client connection and first execution of some query.
In both cases client has to read much more than just one page: actually it reads substantial part of system catalog. It is hadren or may be even thousands of pages. This is why total time can be much larger than 1 second. But with 10Mb max_replication_write_lag or with cached lst written LSN patch I rarely see delays larger than few seconds.

It would be useful to print out the current last_received_lsn once as second during the test. You could then see the speed that the pageserver is processing the WAL. Is it constant, or is there some pauses?

Backpressure tests are printing lag now, adding it to current flush LSN we can monitor speed of consuming WAL by pageserver.

@aome510
Copy link
Contributor Author

aome510 commented Jun 13, 2022

In fact, with only new wal backpressure tests, the total test runtime is increased by ~40min, which is extremely slow.

I do not think that we should perform tests with 500Mb max_replication_write_lag.

However, given the fact that 500MB is the default setting in prod, we should probably wait.

IMHO we should change prod settings ASAP.

Got it. Will disable it in tests later.

It would be useful to print out the current last_received_lsn once as second during the test. You could then see the speed that the pageserver is processing the WAL. Is it constant, or is there some pauses?

Backpressure tests are printing lag now, adding it to current flush LSN we can monitor speed of consuming WAL by pageserver.

I have added the code to print lsn_process_speed and lsn_produce_speed in #1919. I also disable wal_log_hints in the latest commit to see if that makes any difference.

@aome510
Copy link
Contributor Author

aome510 commented Jun 14, 2022

If we assume conservatively that it's actually 10 MB/s, then a 15 MB max_replication_write_lag should give a max 1.5 seconds delay. So why are are we seeing read latency > 10 seconds for some queries?

This is only true if the difference between produce speed and process speed is consistently 10MB/s. IIUC, as a read request requires the latest LSN, theoretically, there can be a situation when produce speed = process speed and the lag is 14 MB. If that happens, the read request'll never finish because the page server keeps failing behind.

@knizhnik
Copy link
Contributor

IIUC, as a read request requires the latest LSN, theoretically, there can be a situation when produce speed = process speed and the lag is 14 MB. If that happens, the read request'll never finish because the page server keeps failing behind.

Read request may require latest LSN. But latest at the moment of issuing this request. If soe WAL is produced after it, we should not wait for it. So the situation you have described is never possible. Achilles won't overtake the tortoise: walreceiver may never catch-up wih walsender without some kind of synchronization. But we are solving different problem: read_page_at_lsn doesn't need to catch-up. It just waits for the particular LSN. And even if request has "latest" flag, we still wait for last_record_lsn at the moment of receiving this requests.

@knizhnik knizhnik closed this Jun 14, 2022
@aome510
Copy link
Contributor Author

aome510 commented Jun 14, 2022

IIUC, as a read request requires the latest LSN, theoretically, there can be a situation when produce speed = process speed and the lag is 14 MB. If that happens, the read request'll never finish because the page server keeps failing behind.

Read request may require latest LSN. But latest at the moment of issuing this request. If soe WAL is produced after it, we should not wait for it. So the situation you have described is never possible. Achilles won't overtake the tortoise: walreceiver may never catch-up wih walsender without some kind of synchronization. But we are solving different problem: read_page_at_lsn doesn't need to catch-up. It just waits for the particular LSN. And even if request has "latest" flag, we still wait for last_record_lsn at the moment of receiving this requests.

Yeah, you're right. The requested page stream only waits for at most req.lsn (the LSN at the time making the read request) LSN. And the time delay for waiting once the page service receives such page request is not big. What I don't understand is that for big read latency request, there is a delay between the time the client starts the request and the time the page service receives the corresponding page stream.

Not sure if the delay is because the page service processes page stream messages sequentially (doesn't it?).

@aome510 aome510 reopened this Jun 14, 2022
@aome510
Copy link
Contributor Author

aome510 commented Jun 15, 2022

@knizhnik I think I quite get the problem now. It's true that a single get_page_at_lsn request is fast, but when the page service receives multiple get_page_at_lsn requests, the total runtime is slow because the page service handles them sequentially and needs to wait for the latest LSN specified by each request. During the wait, the latest LSN for the next request is incremented as PG produces and the page server processes more WALs. For example, assume 10MB/s WAL processing -> 15/10 = 1.5s maximum wait time for each request. A select count(*) from t with 100_000 rows consists of around 375 blocks. So, in the worst cases, the total wait time is (375 * 1.5)s.

Please correct me if I misunderstand something here.

In case of test_heavy_write_workload, SELECT * from t0 where key = 0 should involve one single get_page_at_lsn request, the reason for large latency (~4-5s) is because after we establishing a new connection, handling a read request requires PG to also read other relations. Looking into the logs, they seem to include pg_* tables such as pg_authid_rolname_index, pg_authid_oid_index, etc. This scales up the number of get_page_at_lsn requests for each read-latency record. I have tried to address this problem by reusing a PG cursor when recording read latencies in 041dced.

Compared to the previous commit, the read latency average for this test is reduced by 10x. However, the read latency and runtime for the test_pgbench_intensive_init_workload test are increased significantly. I'll need to look into a bit deeper to find the cause. My quick investigation for the slow read latency is because select count(*) from t request involves reading multiple blocks (mentioned above), hence require requesting multiple get_page_at_lsn requests. But it's not clear to me why it's much slower now.

@knizhnik
Copy link
Contributor

Please correct me if I misunderstand something here.
Yes, you understanding is correct.
PR #1919 address this problem, because updated relation dot not increase last written lsn for all other tables, so we can access catalog relation with old lsn and do not wait each time for page server to catch-up. In this PR cache is very small (just 4), so if many tables (and their indexes) are updated, then this cache will not help. As alternative we can think about much larger cache, may be not for the whole relation but for some relation chunk (i.e. 1Mb) or even individual pages. But such large cache shoudl be shared and so requires synchronized access. And it can become bottleneck.

@aome510
Copy link
Contributor Author

aome510 commented Jun 17, 2022

However, the read latency and runtime for the test_pgbench_intensive_init_workload test are increased significantly.

Actually, after reruning the test, there is not much of a difference between the two runs. I guess the previous anomaly happens because CircleCI is under a heavy load.

@knizhnik if the tests' behaviours defined in #1919 look good to you, I can merge it back to this branch and we can start working on reviewing this PR and merging it into main.

@aome510
Copy link
Contributor Author

aome510 commented Jun 17, 2022

Or we can rebase #1919 and start from there

@knizhnik
Copy link
Contributor

@knizhnik if the tests' behaviours defined in #1919 look good to you, I can merge it back to this branch and we can start working on reviewing this PR and merging it into main.

May be we should better merge #1919 in main?

@aome510
Copy link
Contributor Author

aome510 commented Jun 17, 2022

Close in favour of #1919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tests for #1763
4 participants