Post Mortems

Here we document post-mortems for incidents that have occurred in the past. We aim to learn from our mistakes and prevent them in the future.

Posts will follow a standardised format and conclude with tasks that we aim to implement to avoid recurrence of similar incidents.

Posts

July 11, 2021
in Kubernetes, Database
7 min read

Cascading node failures and ensuing volume problems

A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node to an unresponsive state (00:55 UTC), upon performing a recycle of the affected node volumes were placed into a state where they could not be mounted.

January 30, 2021
in Kubernetes, Networking
5 min read

NodeBalancer networking faults due to memory pressure

At around 14:30 UTC on Saturday 30th January we started experiencing networking issues at the LoadBalancer level between Cloudflare and our Kubernetes cluster. It seems that the misconfiguration was due to memory and CPU pressure.

~~This post-mortem is preliminary, we are still awaiting word from Linode's SysAdmins on any problems they detected.~~

Update 2nd February 2021: Linode have migrated our NodeBalancer to a different machine.

January 12, 2021
in Kubernetes, Database
5 min read

Django site CPU/RAM exhaustion outage

At 03:01 UTC on Tuesday 12th January we experienced a momentary outage of our PostgreSQL database, causing some very minor service downtime.

January 10, 2021
in Kubernetes
4 min read

Primary Kubernetes node outage

We had an outage of our highest spec node due to CPU exhaustion. The outage lasted from around 20:20 to 20:46 UTC, but was not a full service outage.

December 11, 2020
in Kubernetes, Database
4 min read

All services outage

At 19:55 UTC, all services became unresponsive. The DevOps team were already in a call, and immediately started to investigate.

Postgres was running at 100% CPU usage due to a VACUUM, which caused all services that depended on it to stop working. The high CPU left the host unresponsive and it shutdown. Linode Lassie noticed this and triggered a restart.

It did not recover gracefully from this restart, with numerous core services reporting an error, so we had to manually restart core system services using Lens in order to get things working again.

December 11, 2020
in Kubernetes, Database
4 min read

Postgres connection surge

At 13:24 UTC, we noticed the bot was not able to infract, and pythondiscord.com was unavailable. The DevOps team started to investigate.

We discovered that Postgres was not accepting new connections because it had hit 100 clients. This made it unavailable to all services that depended on it.

Ultimately this was resolved by taking down Postgres, remounting the associated volume, and bringing it back up again.