Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist pg_stat informartion in PS #6560

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Persist pg_stat informartion in PS #6560

wants to merge 31 commits into from

Conversation

knizhnik
Copy link
Contributor

@knizhnik knizhnik commented Feb 1, 2024

Problem

Statistic is saved in local file and so lost on compute restart.

Persist in in page server using the same AUX file mechanism used for replication slots

See more about motivation in https://neondb.slack.com/archives/C04DGM6SMTM/p1703077676522789

Summary of changes

Persist postal file using AUX mechanism

Postgres PRs:
neondatabase/postgres#446
neondatabase/postgres#445

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@knizhnik knizhnik requested a review from a team as a code owner February 1, 2024 06:35
@knizhnik knizhnik requested review from conradludgate and removed request for a team February 1, 2024 06:35
Copy link

github-actions bot commented Feb 1, 2024

5274 tests run: 5054 passed, 0 failed, 220 skipped (full report)


Flaky tests (2)

Postgres 17

Code coverage* (full report)

  • functions: 31.4% (7700 of 24505 functions)
  • lines: 48.8% (60455 of 123828 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
8ff6618 at 2024-10-25T13:18:17.300Z :recycle:

@jcsp
Copy link
Collaborator

jcsp commented Feb 21, 2024

Thanks for the ping. Can we enforce a size limit of this WAL write on the postgres side? (i.e. check size in wallog_file_descriptor, and if it exceeds a threshold, log a warning and don't write it).

I'm thinking that since pgstat file is meant to be small, a limit shouldn't affect it, but it will protect us in case this generic "write any file to the WAL" function is misused in future.

@knizhnik
Copy link
Contributor Author

Just check how large statistic file can be: I created 10000 tables, perform some manipulations with them and do vacuum analyze. Is it enough to force generation of large statistic file?
Size of pgstat.stat is 4Mb

@jcsp
Copy link
Collaborator

jcsp commented Feb 26, 2024

To factor in the issues we've had with capacity/replay scaling, what do you think about having separate stores for "big values" (such as pgstats) and "small values" (such as logical replication state)?

Whatever the implementation on the pageserver evolves to, I think it will be helpful to store the ~< 8KiB items separately from the ~<4MiB items. This doesn't have to be a whole separate API, we just need some recognizable bit that the pageserver can use to store them separately.

@jcsp
Copy link
Collaborator

jcsp commented Feb 26, 2024

Cross referencing with #6874 (review) -- since that PR is writing image values regularly for the AuxFilesDirectory structure, it's important that we don't put the multi-megabyte stats value in the same structure, to avoid using unreasonable amounts of storage when logical replication is used on small tenants. The scenario described in that PR was 70,000 writes, so if we were writing a 4MB image every 1000 writes, that would be 280MB of storage for a tenant that could be as small as 30MB -- the rule of thumb I'm using here is that logical replication metadata should never account for the majority of storage in a tenant.

@knizhnik
Copy link
Contributor Author

Cross referencing with #6874 (review) -- since that PR is writing image values regularly for the AuxFilesDirectory structure, it's important that we don't put the multi-megabyte stats value in the same structure, to avoid using unreasonable amounts of storage when logical replication is used on small tenants. The scenario described in that PR was 70,000 writes, so if we were writing a 4MB image every 1000 writes, that would be 280MB of storage for a tenant that could be as small as 30MB -- the rule of thumb I'm using here is that logical replication metadata should never account for the majority of storage in a tenant.

pgstat information is saved only on systems shutdown and unlike snapshots files it is the single key.
So even if compute is restarted each 5 minutes, and there are 10k tables, still it could not cause storage bloating.

Please also notice that:

  1. 10k relations is really extreme case. It is very unlikely that any client will have such large number of relation at all. And for typical number of relations: ~100 - size of statistic is few kilobytes.
  2. This statistic is naturally versioned: if we have some static RO replica then it should take relevant statistic at this moment of time. Providing alternative versioning mechanism is IMHO overkill
  3. We in any case has DBDIR which contains serialised hash of all database relations. So in case of large number of relation storage bloating will be caused not by statistic but by DBDIR rewriting.

@jcsp
Copy link
Collaborator

jcsp commented Feb 27, 2024

So even if compute is restarted each 5 minutes, and there are 10k tables, still it could not cause storage bloating.

It doesn't matter how often the compute is restarted, if its large stats object is written to the same AuxFileDirectory as all the logical replication keys: every 1024 writes to LR slots will trigger a large (>4MiB) image value being written.

10k relations is really extreme case. It is very unlikely that any client will have such large number of relation at all.

Let's design defensively so that we are not relying on good luck to avoid a situation where someone creates a lot of relations (big stats) and the enables logical replication (frequent updates).

We in any case has DBDIR which contains serialised hash of all database relations. So in case of large number of relation storage bloating will be caused not by statistic but by DBDIR rewriting.

dbdir is indeed an issue, but the dbdir key isn't re-written continuously as a result of logical replication slot updates, and each value in the dbdir is only a couple of bytes. The difference is that the AuxFilesDirectory is subject to frequent, continuous writes, so we must not put anything big in that structure because the frequency of writes will lead to repetition of that big object.

@knizhnik
Copy link
Contributor Author

dbdir is indeed an issue, but the dbdir key isn't re-written continuously as a result of logical replication slot updates, and each value in the dbdir is only a couple of bytes. The difference is that the AuxFilesDirectory is subject to frequent, continuous writes, so we must not put anything big in that structure because the frequency of writes will lead to repetition of that big object.

Yes, it is true. Size of each hash map element is only 9 bytes. But still storage size of creation of 10k relations is 4.4Gb.
I do not think that it is unacceptable, but it is 1000x larger than size of pgstat file which is written only once at compute shutdown.

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM for the storage part of the change, but for now, AUX file v2 is not enabled for all regions, so I don't plan to get this merged for now, unless there could be a flag to control whether or not to enable this feature.

I see the Postgres-side of the pull request writes pg_stat data during shutdown checkpoint, so I assume that's not a lot of write traffic and page server should be able to handle that relatively easily. Please correct me if I'm wrong :)

Also, I didn't see pg_stat-related files getting encoded in the aux file v2 mapping? What are the paths of the files produced by pg_stat, and probably we need to assign a prefix ID to that? (see encode_aux_file_key in aux_file.rs)

@knizhnik
Copy link
Contributor Author

Generally LGTM for the storage part of the change, but for now, AUX file v2 is not enabled for all regions, so I don't plan to get this merged for now, unless there could be a flag to control whether or not to enable this feature.

I see the Postgres-side of the pull request writes pg_stat data during shutdown checkpoint, so I assume that's not a lot of write traffic and page server should be able to handle that relatively easily. Please correct me if I'm wrong :)

Yes, pgstat data is written on shutdown checkpoint.
Normal size of this file is about 64kb.
For 10k relations - 4Mb.
As far as computes are suspended relatively rarely, I also do not expect some sigficant extra load on PS.

Also, I didn't see pg_stat-related files getting encoded in the aux file v2 mapping? What are the paths of the files produced by pg_stat, and probably we need to assign a prefix ID to that? (see encode_aux_file_key in aux_file.rs)

Sorry, I missed it.
Fixed.

@knizhnik knizhnik force-pushed the persist_pgstat_file branch 3 times, most recently from 229076b to ddeb08a Compare June 28, 2024 13:17
@knizhnik knizhnik force-pushed the persist_pgstat_file branch 2 times, most recently from f6b620b to bf5c7fb Compare June 29, 2024 08:34
skyzh added a commit that referenced this pull request Jul 1, 2024
Extracted from #6560, currently
we include multiple copies of aux files in the basebackup.

## Summary of changes

Fix the loop.

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
Extracted from #6560, currently
we include multiple copies of aux files in the basebackup.

## Summary of changes

Fix the loop.

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants