Skip to content

Incident Response

Angel Rey edited this page Dec 23, 2020 · 1 revision

When you get that call at 3am and you need to jump into the prod environment to save the day, what are the things to know, and how can you go about fixing the system?

Quick Checks

supervisorctl status   # see if any services are stopped or have errors

tail /opt/oddslingers.poker/data/logs/http-worker.log     # see errors when starting django
supervisorctl restart all

Response Checklist

  1. backup the system, ALWAYS ALWAYS ALWAYS create a backup/snapshot/db_dump before running custom SSH commands on a production server, especially when under pressure to fix things quickly (see db backup instructions)
  2. where are all the files? get a lay of the land and figure out where these keys things are:
    • the main code repo: /opt/oddslingers.poker
    • logs: /data/logs
    • config files: /opt/oddslingers.poker/env/prod.env
    • database: is it on the same machine or a separate server? (check env/{ODDSLINGERS_ENV}.env and env/secrets.env)
    • system resources:
      • check running processes with htop, systemctl status , and supervisorctl status
      • check disk space remaining with ncdu -h /
      • network connectivity iftop mtr
  3. Figure out exactly what you're trying to do, and explain it to another team member before proceeding, e.g.:
    • fix wrong code deployed or broken deploy -> deploy new code
    • fix slow server due to high resource consumption: processes, connections, cpu, disk, etc -> identify bad process with htop, stop it safely, and fix underlying issue
  4. See if there's a tool already built to help achieve the goal you want, e.g.
    • if you need to redeploy, don't fuss with files manually, just find the deploy command and run it
    • if you need to restart a service, use supervisord, don't just killall and run the proc manually, you'll end up conflicting with other services

See the Production article for more information.