Postgres connection surge

At 13:24 UTC, we noticed the bot was not able to infract, and pythondiscord.com was unavailable. The DevOps team started to investigate.

We discovered that Postgres was not accepting new connections because it had hit 100 clients. This made it unavailable to all services that depended on it.

Ultimately this was resolved by taking down Postgres, remounting the associated volume, and bringing it back up again.

Leadup

List the sequence of events that led to the incident

The bot infractions stopped working, and we started investigating.

Impact

Describe how internal and external users were impacted during the incident

Services were unavailable both for internal and external users.

The Help Channel System was unavailable.
Voice Gate and Server Gate were not working.
Moderation commands were unavailable.
Python Discord site & API were unavailable. CloudFlare automatically switched us to Always Online.

Detection

Report when the team detected the incident, and how we could improve detection time

We noticed HTTP 524s coming from CloudFlare, upon attempting database connection we observed the maximum client limit.

We noticed this log in site:

django.db.utils.OperationalError: FATAL:  sorry, too many clients already

We should be monitoring number of clients, and the monitor should alert us when we're approaching the max. That would have allowed for earlier detection, and possibly allowed us to prevent the incident altogether.

We will look at wrouesnel/postgres_exporter for monitoring this.

Response

Who responded to the incident, and what obstacles did they encounter?

Joe Banks responded to the incident. The obstacles were mostly a lack of a clear response strategy.

We should document our recovery procedure so that we're not so dependent on Joe Banks should this happen again while he's unavailable.

Recovery

How was the incident resolved? How can we improve future mitigation?

Delete PostgreSQL deployment kubectl delete deployment/postgres
Delete any remaining pods, WITH force. kubectl delete <pod name> --force --grace-period=0
Unmount volume at Linode
Remount volume at Linode
Reapply deployment kubectl apply -f postgres/deployment.yaml

Five Why's

Run a 5-whys analysis to understand the true cause of the incident.

Postgres was unavailable, so our services died.
Why? Postgres hit max clients, and could not respond.
Why? Unknown, but we saw a number of connections from previous deployments of site. This indicates that database connections are not being terminated properly. Needs further investigation.

Blameless root cause

Note the final root cause and describe what needs to change to prevent reoccurrance

We're not sure what the root cause is, but suspect site is not terminating database connections properly in some cases. We were unable to reproduce this problem.

We've set up new telemetry on Grafana with alerts so that we can investigate this more closely. We will be let know if the number of connections from site exceeds 32, or if the total number of connections exceeds 90.

Lessons learned

What did we learn from this incident?

We must ensure the DevOps team has access to Linode and other key services even if our Bitwarden is down.
We need to ensure we're alerted of any risk factors that have the potential to make Postgres unavailable, since this causes a catastrophic outage of practically all services.
We absolutely need backups for the databases, so that this sort of problem carries less of a risk.
We may need to consider something like pg_bouncer to manage a connection pool so that we don't exceed 100 legitimate clients connected as we connect more services to the postgres database.

Follow-up tasks

List any tasks we should complete that are relevant to this incident

All database backup