Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor stats to use atomics #375

Merged
merged 5 commits into from
Mar 28, 2023
Merged

Conversation

magec
Copy link
Collaborator

@magec magec commented Mar 20, 2023

When we are dealing with a high number of connections, generated stats cannot be consumed fast enough by the stats collector loop. This makes the stats subsystem inconsistent and a log of warning messages are thrown due to unregistered server/clients.

This change refactors the stats subsystem so it uses atomics:

  • Now counters are handled using U64 atomics
  • Event system is dropped and averages are calculated using a loop every 15 seconds.
  • Now, instead of snapshots being generated ever second we keep track of servers/clients that have registered. Each pool/server/client has its own instance of the counter and makes changes directly, instead of adding an event that gets processed later.

Note that now, the system has some contention when registering/deregistering clients, as we need to write to a hash over a lock. Given that it is only when registering/deregistering I assumed performance won't be comopromised, specially given that now, we do not copy snapshots every second anymore.

Test coverage was good enough so I didn't add new ruby tests here.

I had to modify one test because maxwait is not deleted every second now, do we need this?

@magec magec requested a review from levkk March 20, 2023 13:28
@levkk levkk requested a review from drdrsh March 20, 2023 15:27
src/stats/address.rs Outdated Show resolved Hide resolved
@magec magec requested a review from drdrsh March 23, 2023 09:53
@levkk
Copy link
Contributor

levkk commented Mar 23, 2023

Hey @magec , the PR looks amazing, thank you so much for writing it. I've tagged @drdrsh to review it because he's running PgCat in production at Instacart, so I wanted to make sure that the changes looked good to him as well.

On another note, I think it would be great for us 3 (the main production users of PgCat that I know of) to open up a more direct communication channel (e.g. Slack, Discord, whatever), so we can:

  1. share knowledge and battle stories
  2. talk through bugs, fixes, features
  3. share test plans, results

What do you think?

Copy link
Contributor

@levkk levkk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge! Let me know what the production metrics look like! Our main concern is speed of atomics vs. tokio channels and accuracy of the new stats implementation (it took us a few to get the stats right the first time, just because our understanding of them was different than PgBouncer, for which we were writing an in-place replacement).

@magec
Copy link
Collaborator Author

magec commented Mar 24, 2023

Hey!, yes, another communication channel would be great, Slack works for me. As for the PR, I have noticed that we are leaking connections in cl_idle, I think this was already happening don't really know, but yesterday I disconnected everything abruptly and stats kept saying that I had several clients connected (several as in 20k). Will try to reproduce and fix it, the rest works like a charm and I have no issues with load anymore.

src/client.rs Show resolved Hide resolved
When we are dealing with a high number of connections, generated
stats cannot be consumed fast enough by the stats collector loop.
This makes the stats subsystem inconsistent and a log of
warning messages are thrown due to unregistered server/clients.

This change refactors the stats subsystem so it uses atomics:

- Now counters are handled using U64 atomics
- Event system is dropped and averages are calculated using a loop
  every 15 seconds.
- Now, instead of snapshots being generated ever second we keep track of servers/clients
  that have registered. Each pool/server/client has its own instance of the counter and
  makes changes directly, instead of adding an event that gets processed later.
@levkk
Copy link
Contributor

levkk commented Mar 27, 2023

Hey!, yes, another communication channel would be great, Slack works for me. As for the PR, I have noticed that we are leaking connections in cl_idle, I think this was already happening don't really know, but yesterday I disconnected everything abruptly and stats kept saying that I had several clients connected (several as in 20k). Will try to reproduce and fix it, the rest works like a charm and I have no issues with load anymore.

@magec So we ended up settling on Discord for our public chat platform. You can join here: https://discord.com/invite/DmyJP3qJ7U (you can also invite anyone you want, it's public and free for anyone). There is a channel called #pgcat that's been pretty idle, but hopefully we can get it going!

@magec
Copy link
Collaborator Author

magec commented Mar 28, 2023

So, in the end I was testing badly 🤦. In the test I spawn a psql that connects and then kill it and check the cl_idle counter. I do this because if I use ruby PG, then I cannot close the socket without closing the session (which is what I want to test).

Funny thing (I didn't know), psql also forks and the socket is opened in the child, so I was killing the parent, and thus the socket didn't disconnect because the child was left out. Now I execute a pkill and kill both (I create a random string to ensure I don't kill more than I want).

After that, I could see that I do receive the event of the client disconnecting and I could detect where was the issue, which was only stats related in the end.

That said, this is ready to be merged.

@magec magec merged commit 58ce76d into postgresml:main Mar 28, 2023
@magec magec deleted the atomic-stats branch March 28, 2023 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants