Add database pooling using Pgbouncer #2645

JasirZaeem · 2022-12-25T18:00:00Z

The problem

Since our app is serverless, there can be multiple instances of prisma running at any time depending on function invocations by vercel. Each prisma instance will have its own postgres connection. This can lead to occupation of all possible connections to the database. As traffic increases this will become more of a problem. If the max possible number of database connections are occupied then new prisma instances will be unable to connect to the database, leading to the failure of the request that the instance was created for. The connection limits to postgres can be increased but that is a stop gap solution, and unideal because postgres connections are expensive. The typical solution to this situation is to use a connection pooler between the database and the serverless app. This PR documents the process, performance testing and recommendation, and the change required to make the app use the pooler.

Pooling 101

Pgbouncer offers three pooling modes, this write up is specific to transaction pooling.

Pgbouncer dynamically maps a connection between the client (here, the app on vercel) and it, to a connection between it and Postgres. A Postgres connection is much heavier to setup, tear down, and keep open, than a connection to pgbouncer so we would prefer to have more concurrent pgbouncer (pgb) connections than postgres (pg, db) connections.


Database connections are preferably fewer than pgbouncer connections

Pgb maintains a pool of connections to the db that it opens and closes as needed. These connections are identified by username and db. If I request a connection as user prod_user to the database prod_db, pgb will check if it already has a connection open to pg with those parameters. If such a connection is open and not in use currently, then my request will be communicated to the db through that connection, otherwise pgb will spawn a new connection to db and use that. If the max number of connections to pg are in use (20 For now) then pgb will put the request in a queue and fulfil it when a connection frees up, this way we won't face the problem of exhaustion of db connections.

In transaction pooling mode, a db connection is only held by a client during a transaction. As soon the transaction is finished the connection is free to be put back by pgbouncer in the pool and reused. Prisma wraps queries in transactions, the following scenario explains the advantage of pgb in this case where N function invocations happen in short while but not necessarily concurrently.

Without pgbouncer
- Vercel invokes a function N times one after other in a short interval.
- N instances of prisma created.
- each instance creates a connection to the database while the previous connections are still open.
- N connections to Postgres are created to fulfil the queries even though the queries were not concurrent.
DB Connections used - min(max_connections, N), Connections past max_connections (100 right now, but fewer in practice due to other services. using the db as well) will fail


Without pgbouncer, 5 sequential transactions happen but 5 database connections are being used.

With pgbouncer
- Vercel invokes a function N times one after other in a short interval.
- N instances of prisma created.
- each instance creates a connection to pgbouncer.
- pgbouncer reuses the database connection it has assuming the requests were not concurrent.
DB Connections used - as few as 1, at most pool_size (currently 20). If the requests were truly concurrent upto pool_size connections will be used and other requests will be put in queue. If the requests are not concurrent (typical case) fewer db connections will be used as they'll be freed up before new transactions are requested.


With pgbouncer, 5 sequential transactions happen and only 1 database connection is used (5 connections to pgbouncer are used which are a lot less resource intensive).

Even if all the connections happen in parallel pgbouncer will be able to queue them and fulfil them eventually where as postgres will reject connections.

Installing and configuring Pgbouncer

Add the Postgres apt repository to install up to date pgbouncer sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -
apt-get update
apt-get install pgbouncer
su - postgres perform further actions using the postgres user
Configure pgbouncer.ini and userlist.txt in /etc/pgbouncer (configuration given below)
startup pgbouncer pgbouncer -d /etc/pgbouncer/pgbouncer.ini as a daemon (runs in background)
can connect to pgbouncer virtual database to manage pgbouncer

Pgbouncer configuration

Relevant options in pgbouncer.ini
- under [databases] configure access
  - e.g. db = host=/var/lib/postgresql dbname=db auth_user=db_user pool_mode=transaction
- under [pgbouncer]
  - listen_addr = * (to enable connections from outside)
  - auth_filfe = /etc/pgbouncer/userlist.txt (list of authorised users)
  - admin_users = db_user (users that can access the virtual management database pgbouncer)
  - max_client_conn = 200 (max number of clients pgbouncer will allow to connect, currently 200 [not the same as connections to postgres], can be increased if needed with less resource requirements than increasing Postgres connection limit)
  - default_pool_size = 20 (number of concurrent connections to Postgres that pgbouncer will use at most, requests after this will be queued to be process when some connections free up)
  - min_pool_size = 5 (number of connections that will be kept in pool even without traffic, useful for quickly responding to requests after a while as some open connections will always be in pool)
userlists.txt
- each line contains a pair of user and credential
- e.g. "db_user" "credential"
- Multiple formats for credential supported, including plain password.
- Format used in production is:
  - "'md5' + md5(password + user)"
  - if password is db_password, credential is
  - "md5" + md5("db_passworddb_user") -> "md5" + "c6eff08c433a190ac15326e048403089"
  - entry becomes -> "db_user" "md5c6eff08c433a190ac15326e048403089"

Performance testing

I have temporarily created a copy of the production database and a copy of the site deployment to load test and measure the effect pgbouncer has. I used k6 (k6.io) to perform load testing, script used for testing is available in the engineering channel. The script performs the following steps:

Get a csrf token from the app
Use the csrf token and given username/password to login
Use the returned session token cookie to make a request to the session endpoint
Request the curriculum page

K6 runs copies of the scripts in parallel in a loop for a given period of time, for this test I've run 5, 10 ... 100 copies in parallel for 1 minute with and without pgbouncer with the following results. Vercel doesn't always recreate prisma instances. A running function can be reused (warm function invocation) if it is called once more and vercel has pre existing running instance of the function, in this case a new prisma instance is also not created and it is reused. Postgres times out connections after 10 minutes from observation, so if multiple connection requests are made within 10 minutes then number of open connections will increase.
Testing regime:

Run the virtual users
Sleep for some time and run again, repeat 1 and 2 until 1 minute.
After one set of runs end all lingering connections to postgres before starting again
Repeat 1 to 3 for 5, 10, 15, ... 100 simultaneous workers.

I have performed two kinds of tests

A load test. Workers are executed with a 10 second breaks in between. Supposed to simulate a more typical bursty load.
A stress test. Workers are executed with 1 second break in between. Supposed to simulate a constant heavy load, not typical.

Performance testing results

Pooling/pgbouncer will have an overhead to the request time/throughput of the app, so using pooling is not a method without cons. Pgbouncer is a single threaded app so out small server with vcpus is not the ideal environment for it but it is still useful. If the number of connections start to approach the point where the database connections start getting exhausted, pooling will become necessary, we can wait to see if there is a need before enabling pooling (by setting the USE_POOLED_DB env var). Affect on throughput is greater under constant heavy load (atypical for us) where requests are being made at constant but high pace. While raw database connections will serve more requests per second in this case, the requests will starts to fail once the connections are exhausted. Pgbouncer will reduce the requests per second 30% to 80%. But with pgbouncer when the connections surpass the db capacity the requests will just take longer to fulfil (waiting in pgbouncer's queue) instead of failing. In case of bursty load pgbouncer catches up to raw connections in terms of performance. While the average response time and throughtput does diminish under load with pgbouncer, the median response time is similar for both. Because of similar median performance we should not notice a large difference when deployed.

While 100 concurrent requests is a lot of traffic, you do not need 100 concurrent requests constantly to reach 100 open db connections. Since db connections live for some time it is possible to exhaust database connections with far less traffic due to the serverless nature of our app, a search engine bot crawling the website, a sudden burst of traffic due to a study meet are situations which can exhaust the db connections and requests will start failing.
Some examples of the current deployment's tendencies:

Opening the home page without being logged can result in 3-4 database connections.
Opening the curriculum page while logged in can result in upto 10 open database connections.
Opening the curriculum page while logged in and being an admin can result in up to 20 open connections.

These observations show that even with our current amount traffic, different patterns of traffic can exhaust connection limit. A serverless deployment makes databse connection pooling necessary.

Performance testing data

Stress Testing (1 second between iterations)	Load Testing (10 seconds between iterations)

Median response time similar with pooling, while preserving reliability	Median response time similar with pooling, while preserving reliability

Average response time/throughput decreases with pooling, but requests don't fail with more load	Pooling overhead less significant with bursty load

Requests start to fail without pooling with increasing load, does not happen with pooling	Requests fail earlier with bursty load, does not happen with pooling

Raw connections provide greater throughput, but start failing. Though, performance of pooling can be improved by letting pgbouncer open more connections to the database, red line is with 50 db connection limit at peak load	With more time between iterations in bursty load pooling is able to catch up

A lot more connections to postgres are being used without pooling until no more are possible. Pgbouncer caps the connections to the databse while letting more prisma instances from vercel to connect to the server and holding extra connections in queue	A lot more connections to postgres are being used without pooling until no more are possible. Pgbouncer caps the connections to the databse while letting more prisma instances from vercel to connect to the server and holding extra connections in queue

Prisma specific change

While prisma can work with pgbouncer in transaction mode to make queries, it cannot perform migrations through pgbouncer in transaction mode, so a direct connection to the database is needed during migration, this PR has added this. Prisma is given a database connection url in the following places:

In prisma schema, used when actions like prisma deploy, prisma migrate happen

c0d3-app/prisma/schema.prisma

Line 7 in d8a1aca

url = env("DB_URL")
In prismaUtils DB_URL is written to file which is then used by prisma, used during build phase

c0d3-app/scripts/prismaUtils.js

Line 66 in d8a1aca

writeFileSync(file, `DB_URL=${url}`, 'utf-8')
In prisma/index.ts , this is used when the app is running

c0d3-app/prisma/index.ts

Line 11 in d8a1aca

url: `postgresql://${process.env.DB_USER}:${process.env.DB_PW}@${process.env.DB_HOST}:${process.env.DB_PORT}/${process.env.DB_NAME}?connection_limit=1`

This PR modifies 3, when the app is running on vercel in the production environment and the env var to use pooling is set (USE_POOLED_DB) the url is switched out for the pgbouncer url, in all other cases (e.g. preview deployment, local development, migration etc) a direct connection to the database is made.

The alternative would be to use pgbouncer's session pooling mode, but that will negate the benefit of pooling for us as the problem faced right now is that when requests happens rapidly (e.g. due to higher traffic) many database connections are made which slowly time out and disconnect, the time out is around 10 minutes. This time lag leads to buildup of simultaneously open connections. In session pooling mode there will be a 1-to-1 mapping of connections to pgbouncer and postgres which will not alleviate the connection exhaustion problem. So there is a need to use the more aggressive transaction pooling.

vercel · 2022-12-25T18:00:03Z

@JasirZaeem is attempting to deploy a commit to the c0d3-prod Team on Vercel.

A member of the Team first needs to authorize it.

codecov · 2022-12-25T18:06:08Z

Codecov Report

Merging #2645 (45b571c) into master (1e5d4c0) will not change coverage.
The diff coverage is n/a.

❗ Current head 45b571c differs from pull request most recent head 908194d. Consider uploading reports for the commit 908194d to get more accurate results

@@            Coverage Diff            @@
##            master     #2645   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          189       189           
  Lines         3473      3473           
  Branches       960       960           
=========================================
  Hits          3473      3473

vercel · 2022-12-25T19:17:40Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
c0d3-app	✅ Ready (Inspect)	Visit Preview	Dec 27, 2022 at 11:21AM (UTC)

flacial

Thanks for the very detailed document 👍

Use pooled database URL when required

9733815

Merge branch 'master' into add-pooled-db-url

0f4da60

vercel bot deployed to Preview December 25, 2022 19:19 View deployment

flacial approved these changes Dec 26, 2022

View reviewed changes

Merge branch 'master' into add-pooled-db-url

45b571c

vercel bot deployed to Preview December 26, 2022 17:31 View deployment

SlyBouhafs approved these changes Dec 27, 2022

View reviewed changes

Merge branch 'master' into add-pooled-db-url

908194d

vercel bot deployed to Preview December 27, 2022 11:21 View deployment

JasirZaeem merged commit 9065dce into garageScript:master Dec 27, 2022

JasirZaeem deleted the add-pooled-db-url branch December 27, 2022 11:22

flacial changed the title ~~Add databse pooling using Pgbouncer~~ Add database pooling using Pgbouncer Jan 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add database pooling using Pgbouncer #2645

Add database pooling using Pgbouncer #2645

JasirZaeem commented Dec 25, 2022 •

edited

Loading

vercel bot commented Dec 25, 2022

codecov bot commented Dec 25, 2022 •

edited

Loading

vercel bot commented Dec 25, 2022 •

edited

Loading

flacial left a comment

Add database pooling using Pgbouncer #2645

Add database pooling using Pgbouncer #2645

Conversation

JasirZaeem commented Dec 25, 2022 • edited Loading

The problem

Pooling 101

Installing and configuring Pgbouncer

Pgbouncer configuration

Relevant options in pgbouncer.ini

userlists.txt

Performance testing

Performance testing results

Performance testing data

Prisma specific change

vercel bot commented Dec 25, 2022

codecov bot commented Dec 25, 2022 • edited Loading

Codecov Report

vercel bot commented Dec 25, 2022 • edited Loading

flacial left a comment

Choose a reason for hiding this comment

JasirZaeem commented Dec 25, 2022 •

edited

Loading

codecov bot commented Dec 25, 2022 •

edited

Loading

vercel bot commented Dec 25, 2022 •

edited

Loading