-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add database pooling using Pgbouncer #2645
Conversation
@JasirZaeem is attempting to deploy a commit to the c0d3-prod Team on Vercel. A member of the Team first needs to authorize it. |
Codecov Report
@@ Coverage Diff @@
## master #2645 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 189 189
Lines 3473 3473
Branches 960 960
=========================================
Hits 3473 3473 |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very detailed document 👍
For #2639
The problem
Since our app is serverless, there can be multiple instances of prisma running at any time depending on function invocations by vercel. Each prisma instance will have its own postgres connection. This can lead to occupation of all possible connections to the database. As traffic increases this will become more of a problem. If the max possible number of database connections are occupied then new prisma instances will be unable to connect to the database, leading to the failure of the request that the instance was created for. The connection limits to postgres can be increased but that is a stop gap solution, and unideal because postgres connections are expensive. The typical solution to this situation is to use a connection pooler between the database and the serverless app. This PR documents the process, performance testing and recommendation, and the change required to make the app use the pooler.
Pooling 101
Pgbouncer offers three pooling modes, this write up is specific to transaction pooling.
Pgbouncer dynamically maps a connection between the client (here, the app on vercel) and it, to a connection between it and Postgres. A Postgres connection is much heavier to setup, tear down, and keep open, than a connection to pgbouncer so we would prefer to have more concurrent pgbouncer (pgb) connections than postgres (pg, db) connections.
Pgb maintains a pool of connections to the db that it opens and closes as needed. These connections are identified by username and db. If I request a connection as user
prod_user
to the databaseprod_db
, pgb will check if it already has a connection open to pg with those parameters. If such a connection is open and not in use currently, then my request will be communicated to the db through that connection, otherwise pgb will spawn a new connection to db and use that. If the max number of connections to pg are in use (20 For now) then pgb will put the request in a queue and fulfil it when a connection frees up, this way we won't face the problem of exhaustion of db connections.In transaction pooling mode, a db connection is only held by a client during a transaction. As soon the transaction is finished the connection is free to be put back by pgbouncer in the pool and reused. Prisma wraps queries in transactions, the following scenario explains the advantage of pgb in this case where N function invocations happen in short while but not necessarily concurrently.
min(max_connections, N)
, Connections past max_connections (100 right now, but fewer in practice due to other services. using the db as well) will failpool_size
(currently 20). If the requests were truly concurrent uptopool_size
connections will be used and other requests will be put in queue. If the requests are not concurrent (typical case) fewer db connections will be used as they'll be freed up before new transactions are requested.Even if all the connections happen in parallel pgbouncer will be able to queue them and fulfil them eventually where as postgres will reject connections.
Installing and configuring Pgbouncer
sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -
apt-get update
apt-get install pgbouncer
su - postgres
perform further actions using the postgres user/etc/pgbouncer
(configuration given below)pgbouncer -d /etc/pgbouncer/pgbouncer.ini
as a daemon (runs in background)Pgbouncer configuration
Relevant options in pgbouncer.ini
[databases]
configure accessdb = host=/var/lib/postgresql dbname=db auth_user=db_user pool_mode=transaction
listen_addr = *
(to enable connections from outside)auth_filfe = /etc/pgbouncer/userlist.txt
(list of authorised users)admin_users = db_user
(users that can access the virtual management database pgbouncer)max_client_conn = 200
(max number of clients pgbouncer will allow to connect, currently 200 [not the same as connections to postgres], can be increased if needed with less resource requirements than increasing Postgres connection limit)default_pool_size = 20
(number of concurrent connections to Postgres that pgbouncer will use at most, requests after this will be queued to be process when some connections free up)min_pool_size = 5
(number of connections that will be kept in pool even without traffic, useful for quickly responding to requests after a while as some open connections will always be in pool)userlists.txt
"db_user" "md5c6eff08c433a190ac15326e048403089"
Performance testing
I have temporarily created a copy of the production database and a copy of the site deployment to load test and measure the effect pgbouncer has. I used k6 (k6.io) to perform load testing, script used for testing is available in the engineering channel. The script performs the following steps:
K6 runs copies of the scripts in parallel in a loop for a given period of time, for this test I've run 5, 10 ... 100 copies in parallel for 1 minute with and without pgbouncer with the following results. Vercel doesn't always recreate prisma instances. A running function can be reused (warm function invocation) if it is called once more and vercel has pre existing running instance of the function, in this case a new prisma instance is also not created and it is reused. Postgres times out connections after 10 minutes from observation, so if multiple connection requests are made within 10 minutes then number of open connections will increase.
Testing regime:
I have performed two kinds of tests
Performance testing results
Pooling/pgbouncer will have an overhead to the request time/throughput of the app, so using pooling is not a method without cons. Pgbouncer is a single threaded app so out small server with vcpus is not the ideal environment for it but it is still useful. If the number of connections start to approach the point where the database connections start getting exhausted, pooling will become necessary, we can wait to see if there is a need before enabling pooling (by setting the
USE_POOLED_DB
env var). Affect on throughput is greater under constant heavy load (atypical for us) where requests are being made at constant but high pace. While raw database connections will serve more requests per second in this case, the requests will starts to fail once the connections are exhausted. Pgbouncer will reduce the requests per second 30% to 80%. But with pgbouncer when the connections surpass the db capacity the requests will just take longer to fulfil (waiting in pgbouncer's queue) instead of failing. In case of bursty load pgbouncer catches up to raw connections in terms of performance. While the average response time and throughtput does diminish under load with pgbouncer, the median response time is similar for both. Because of similar median performance we should not notice a large difference when deployed.While 100 concurrent requests is a lot of traffic, you do not need 100 concurrent requests constantly to reach 100 open db connections. Since db connections live for some time it is possible to exhaust database connections with far less traffic due to the serverless nature of our app, a search engine bot crawling the website, a sudden burst of traffic due to a study meet are situations which can exhaust the db connections and requests will start failing.
Some examples of the current deployment's tendencies:
These observations show that even with our current amount traffic, different patterns of traffic can exhaust connection limit. A serverless deployment makes databse connection pooling necessary.
Performance testing data
Prisma specific change
While prisma can work with pgbouncer in transaction mode to make queries, it cannot perform migrations through pgbouncer in transaction mode, so a direct connection to the database is needed during migration, this PR has added this. Prisma is given a database connection url in the following places:
c0d3-app/prisma/schema.prisma
Line 7 in d8a1aca
c0d3-app/scripts/prismaUtils.js
Line 66 in d8a1aca
c0d3-app/prisma/index.ts
Line 11 in d8a1aca
This PR modifies 3, when the app is running on vercel in the production environment and the env var to use pooling is set (
USE_POOLED_DB
) the url is switched out for the pgbouncer url, in all other cases (e.g. preview deployment, local development, migration etc) a direct connection to the database is made.The alternative would be to use pgbouncer's session pooling mode, but that will negate the benefit of pooling for us as the problem faced right now is that when requests happens rapidly (e.g. due to higher traffic) many database connections are made which slowly time out and disconnect, the time out is around 10 minutes. This time lag leads to buildup of simultaneously open connections. In session pooling mode there will be a 1-to-1 mapping of connections to pgbouncer and postgres which will not alleviate the connection exhaustion problem. So there is a need to use the more aggressive transaction pooling.