Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add database pooling using Pgbouncer #2645

Merged
merged 4 commits into from
Dec 27, 2022

Conversation

JasirZaeem
Copy link
Member

@JasirZaeem JasirZaeem commented Dec 25, 2022

For #2639

The problem

Since our app is serverless, there can be multiple instances of prisma running at any time depending on function invocations by vercel. Each prisma instance will have its own postgres connection. This can lead to occupation of all possible connections to the database. As traffic increases this will become more of a problem. If the max possible number of database connections are occupied then new prisma instances will be unable to connect to the database, leading to the failure of the request that the instance was created for. The connection limits to postgres can be increased but that is a stop gap solution, and unideal because postgres connections are expensive. The typical solution to this situation is to use a connection pooler between the database and the serverless app. This PR documents the process, performance testing and recommendation, and the change required to make the app use the pooler.

Pooling 101

Pgbouncer offers three pooling modes, this write up is specific to transaction pooling.

Pgbouncer dynamically maps a connection between the client (here, the app on vercel) and it, to a connection between it and Postgres. A Postgres connection is much heavier to setup, tear down, and keep open, than a connection to pgbouncer so we would prefer to have more concurrent pgbouncer (pgb) connections than postgres (pg, db) connections.

pooling_diag
Database connections are preferably fewer than pgbouncer connections

Pgb maintains a pool of connections to the db that it opens and closes as needed. These connections are identified by username and db. If I request a connection as user prod_user to the database prod_db, pgb will check if it already has a connection open to pg with those parameters. If such a connection is open and not in use currently, then my request will be communicated to the db through that connection, otherwise pgb will spawn a new connection to db and use that. If the max number of connections to pg are in use (20 For now) then pgb will put the request in a queue and fulfil it when a connection frees up, this way we won't face the problem of exhaustion of db connections.

In transaction pooling mode, a db connection is only held by a client during a transaction. As soon the transaction is finished the connection is free to be put back by pgbouncer in the pool and reused. Prisma wraps queries in transactions, the following scenario explains the advantage of pgb in this case where N function invocations happen in short while but not necessarily concurrently.

  • Without pgbouncer
    • Vercel invokes a function N times one after other in a short interval.
    • N instances of prisma created.
    • each instance creates a connection to the database while the previous connections are still open.
    • N connections to Postgres are created to fulfil the queries even though the queries were not concurrent.
  • DB Connections used - min(max_connections, N), Connections past max_connections (100 right now, but fewer in practice due to other services. using the db as well) will fail
pooling-1
Without pgbouncer, 5 sequential transactions happen but 5 database connections are being used.
  • With pgbouncer
    • Vercel invokes a function N times one after other in a short interval.
    • N instances of prisma created.
    • each instance creates a connection to pgbouncer.
    • pgbouncer reuses the database connection it has assuming the requests were not concurrent.
  • DB Connections used - as few as 1, at most pool_size (currently 20). If the requests were truly concurrent upto pool_size connections will be used and other requests will be put in queue. If the requests are not concurrent (typical case) fewer db connections will be used as they'll be freed up before new transactions are requested.
pooling-1-pgb
With pgbouncer, 5 sequential transactions happen and only 1 database connection is used (5 connections to pgbouncer are used which are a lot less resource intensive).

Even if all the connections happen in parallel pgbouncer will be able to queue them and fulfil them eventually where as postgres will reject connections.

Installing and configuring Pgbouncer

  • Add the Postgres apt repository to install up to date pgbouncer sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
  • wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -
  • apt-get update
  • apt-get install pgbouncer
  • su - postgres perform further actions using the postgres user
  • Configure pgbouncer.ini and userlist.txt in /etc/pgbouncer (configuration given below)
  • startup pgbouncer pgbouncer -d /etc/pgbouncer/pgbouncer.ini as a daemon (runs in background)
  • can connect to pgbouncer virtual database to manage pgbouncer

Pgbouncer configuration

  • Relevant options in pgbouncer.ini

    • under [databases] configure access
      • e.g. db = host=/var/lib/postgresql dbname=db auth_user=db_user pool_mode=transaction
    • under [pgbouncer]
      • listen_addr = * (to enable connections from outside)
      • auth_filfe = /etc/pgbouncer/userlist.txt (list of authorised users)
      • admin_users = db_user (users that can access the virtual management database pgbouncer)
      • max_client_conn = 200 (max number of clients pgbouncer will allow to connect, currently 200 [not the same as connections to postgres], can be increased if needed with less resource requirements than increasing Postgres connection limit)
      • default_pool_size = 20 (number of concurrent connections to Postgres that pgbouncer will use at most, requests after this will be queued to be process when some connections free up)
      • min_pool_size = 5 (number of connections that will be kept in pool even without traffic, useful for quickly responding to requests after a while as some open connections will always be in pool)
  • userlists.txt

    • each line contains a pair of user and credential
    • e.g. "db_user" "credential"
    • Multiple formats for credential supported, including plain password.
    • Format used in production is:
      • "'md5' + md5(password + user)"
      • if password is db_password, credential is
      • "md5" + md5("db_passworddb_user") -> "md5" + "c6eff08c433a190ac15326e048403089"
      • entry becomes -> "db_user" "md5c6eff08c433a190ac15326e048403089"

Performance testing

I have temporarily created a copy of the production database and a copy of the site deployment to load test and measure the effect pgbouncer has. I used k6 (k6.io) to perform load testing, script used for testing is available in the engineering channel. The script performs the following steps:

  1. Get a csrf token from the app
  2. Use the csrf token and given username/password to login
  3. Use the returned session token cookie to make a request to the session endpoint
  4. Request the curriculum page

K6 runs copies of the scripts in parallel in a loop for a given period of time, for this test I've run 5, 10 ... 100 copies in parallel for 1 minute with and without pgbouncer with the following results. Vercel doesn't always recreate prisma instances. A running function can be reused (warm function invocation) if it is called once more and vercel has pre existing running instance of the function, in this case a new prisma instance is also not created and it is reused. Postgres times out connections after 10 minutes from observation, so if multiple connection requests are made within 10 minutes then number of open connections will increase.
Testing regime:

  1. Run the virtual users
  2. Sleep for some time and run again, repeat 1 and 2 until 1 minute.
  3. After one set of runs end all lingering connections to postgres before starting again
  4. Repeat 1 to 3 for 5, 10, 15, ... 100 simultaneous workers.

I have performed two kinds of tests

  • A load test. Workers are executed with a 10 second breaks in between. Supposed to simulate a more typical bursty load.
  • A stress test. Workers are executed with 1 second break in between. Supposed to simulate a constant heavy load, not typical.

Performance testing results

Pooling/pgbouncer will have an overhead to the request time/throughput of the app, so using pooling is not a method without cons. Pgbouncer is a single threaded app so out small server with vcpus is not the ideal environment for it but it is still useful. If the number of connections start to approach the point where the database connections start getting exhausted, pooling will become necessary, we can wait to see if there is a need before enabling pooling (by setting the USE_POOLED_DB env var). Affect on throughput is greater under constant heavy load (atypical for us) where requests are being made at constant but high pace. While raw database connections will serve more requests per second in this case, the requests will starts to fail once the connections are exhausted. Pgbouncer will reduce the requests per second 30% to 80%. But with pgbouncer when the connections surpass the db capacity the requests will just take longer to fulfil (waiting in pgbouncer's queue) instead of failing. In case of bursty load pgbouncer catches up to raw connections in terms of performance. While the average response time and throughtput does diminish under load with pgbouncer, the median response time is similar for both. Because of similar median performance we should not notice a large difference when deployed.

While 100 concurrent requests is a lot of traffic, you do not need 100 concurrent requests constantly to reach 100 open db connections. Since db connections live for some time it is possible to exhaust database connections with far less traffic due to the serverless nature of our app, a search engine bot crawling the website, a sudden burst of traffic due to a study meet are situations which can exhaust the db connections and requests will start failing.
Some examples of the current deployment's tendencies:

  1. Opening the home page without being logged can result in 3-4 database connections.
  2. Opening the curriculum page while logged in can result in upto 10 open database connections.
  3. Opening the curriculum page while logged in and being an admin can result in up to 20 open connections.

These observations show that even with our current amount traffic, different patterns of traffic can exhaust connection limit. A serverless deployment makes databse connection pooling necessary.

Performance testing data

Stress Testing (1 second between iterations) Load Testing (10 seconds between iterations)
image image
Median response time similar with pooling, while preserving reliability Median response time similar with pooling, while preserving reliability
image image
Average response time/throughput decreases with pooling, but requests don't fail with more load Pooling overhead less significant with bursty load
image image
Requests start to fail without pooling with increasing load, does not happen with pooling Requests fail earlier with bursty load, does not happen with pooling
image image
Raw connections provide greater throughput, but start failing. Though, performance of pooling can be improved by letting pgbouncer open more connections to the database, red line is with 50 db connection limit at peak load With more time between iterations in bursty load pooling is able to catch up
image image
A lot more connections to postgres are being used without pooling until no more are possible. Pgbouncer caps the connections to the databse while letting more prisma instances from vercel to connect to the server and holding extra connections in queue A lot more connections to postgres are being used without pooling until no more are possible. Pgbouncer caps the connections to the databse while letting more prisma instances from vercel to connect to the server and holding extra connections in queue

Prisma specific change

While prisma can work with pgbouncer in transaction mode to make queries, it cannot perform migrations through pgbouncer in transaction mode, so a direct connection to the database is needed during migration, this PR has added this. Prisma is given a database connection url in the following places:

  1. In prisma schema, used when actions like prisma deploy, prisma migrate happen
    url = env("DB_URL")
  2. In prismaUtils DB_URL is written to file which is then used by prisma, used during build phase
    writeFileSync(file, `DB_URL=${url}`, 'utf-8')
  3. In prisma/index.ts , this is used when the app is running
    url: `postgresql://${process.env.DB_USER}:${process.env.DB_PW}@${process.env.DB_HOST}:${process.env.DB_PORT}/${process.env.DB_NAME}?connection_limit=1`

This PR modifies 3, when the app is running on vercel in the production environment and the env var to use pooling is set (USE_POOLED_DB) the url is switched out for the pgbouncer url, in all other cases (e.g. preview deployment, local development, migration etc) a direct connection to the database is made.

The alternative would be to use pgbouncer's session pooling mode, but that will negate the benefit of pooling for us as the problem faced right now is that when requests happens rapidly (e.g. due to higher traffic) many database connections are made which slowly time out and disconnect, the time out is around 10 minutes. This time lag leads to buildup of simultaneously open connections. In session pooling mode there will be a 1-to-1 mapping of connections to pgbouncer and postgres which will not alleviate the connection exhaustion problem. So there is a need to use the more aggressive transaction pooling.

@vercel
Copy link

vercel bot commented Dec 25, 2022

@JasirZaeem is attempting to deploy a commit to the c0d3-prod Team on Vercel.

A member of the Team first needs to authorize it.

@codecov
Copy link

codecov bot commented Dec 25, 2022

Codecov Report

Merging #2645 (45b571c) into master (1e5d4c0) will not change coverage.
The diff coverage is n/a.

❗ Current head 45b571c differs from pull request most recent head 908194d. Consider uploading reports for the commit 908194d to get more accurate results

Impacted file tree graph

@@            Coverage Diff            @@
##            master     #2645   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          189       189           
  Lines         3473      3473           
  Branches       960       960           
=========================================
  Hits          3473      3473           

@vercel
Copy link

vercel bot commented Dec 25, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
c0d3-app ✅ Ready (Inspect) Visit Preview Dec 27, 2022 at 11:21AM (UTC)

Copy link
Member

@flacial flacial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very detailed document 👍

@JasirZaeem JasirZaeem merged commit 9065dce into garageScript:master Dec 27, 2022
@JasirZaeem JasirZaeem deleted the add-pooled-db-url branch December 27, 2022 11:22
@flacial flacial changed the title Add databse pooling using Pgbouncer Add database pooling using Pgbouncer Jan 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants