Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big insert effectively prevents any getPage requests #1207

Open
Tracked by #2028
kelvich opened this issue Feb 4, 2022 · 9 comments
Open
Tracked by #2028

Big insert effectively prevents any getPage requests #1207

kelvich opened this issue Feb 4, 2022 · 9 comments
Labels
c/storage/pageserver Component: storage: pageserver m/bug-bash-feb22 Moment: during the Bug Bash in Feb 22 p/cloud Product: Neon Cloud
Milestone

Comments

@kelvich
Copy link
Contributor

kelvich commented Feb 4, 2022

If we run insert in one session:

create table t2 as select generate_series(1,10000000000);

and concurrently try get some uncached pages from the pageserver:

\d+

second backend will fail with:

ERROR:  could not read block 0 in rel 1663/16385/1255.0 from page server at lsn 0/3E5FC238
DETAIL:  page server returned error: Timed out while waiting for WAL record at LSN 0/3E5FC238 to arrive, last_record_lsn 0/1043E8D8 disk consistent LSN=0/1695D48
@kelvich kelvich added the m/bug-bash-feb22 Moment: during the Bug Bash in Feb 22 label Feb 4, 2022
@stepashka stepashka added the p/cloud Product: Neon Cloud label Feb 4, 2022
@knizhnik
Copy link
Contributor

knizhnik commented Feb 4, 2022

All back-pressure thresholds are disabled.
Isn't it expected behavior in this case?

@kelvich
Copy link
Contributor Author

kelvich commented Feb 4, 2022

@ololobus
Copy link
Member

ololobus commented Feb 4, 2022

Yeah, will do it asap

@ololobus
Copy link
Member

ololobus commented Feb 6, 2022

Backpressure should be turned on on both staging and prod, so it's worth trying similar test again

@hlinnaka
Copy link
Contributor

I can still reproduce this. My hunch is lock starvation in the pageserver, but needs to be investigated.

@hlinnaka hlinnaka added the c/storage/pageserver Component: storage: pageserver label Feb 22, 2022
@knizhnik
Copy link
Contributor

knizhnik commented Mar 1, 2022

As it was written in comment to backpressure settings, WAL replay speed is about 10Mb/sec.
Current default setting for "max_replication_write_lag" is 500Mb. Which should provide maximal wait lsn time about 50 sec (< 60 sec). But actually with safekeeper at my notebook speed is slower and 500MB write_lag really timeout expiration and so read errors. So to prevent noticeable delays (> 1 seconds) we should not use write_lag > 10Mb. At least in production.
I can create PR for it, changing default value in compute.rs. But console has its own limits...

@stepashka
Copy link
Member

could you help reproduce and propose the solution, @MMeent ?

@aome510 aome510 mentioned this issue Jun 7, 2022
2 tasks
@neondatabase-bot neondatabase-bot bot added this to the 2022/07 milestone Jul 5, 2022
@neondatabase-bot neondatabase-bot bot modified the milestones: 2022/07, 2022/08 Jul 25, 2022
@neondatabase-bot neondatabase-bot bot modified the milestones: 2022/08, 2023/03 Dec 20, 2022
@shanyp
Copy link
Contributor

shanyp commented Dec 26, 2023

@jcsp is this still the case after the batch ingestion contributer pr ?

@jcsp
Copy link
Contributor

jcsp commented Jan 2, 2024

The original report is from almost two years ago, so I don't think we can say much without re-testing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver m/bug-bash-feb22 Moment: during the Bug Bash in Feb 22 p/cloud Product: Neon Cloud
Projects
None yet
Development

No branches or pull requests

7 participants