A PostgreSQL connection spike (00:27 UTC) caused by Django moved a node
to an unresponsive state (00:55 UTC), upon performing a recycle of the
affected node volumes were placed into a state where they could not be
mounted.
At around 14:30 UTC on Saturday 30th January we started experiencing
networking issues at the LoadBalancer level between Cloudflare and our
Kubernetes cluster. It seems that the misconfiguration was due to memory
and CPU pressure.
This post-mortem is preliminary, we are still awaiting word
from Linode's SysAdmins on any problems they detected.
Update 2nd February 2021: Linode have migrated our NodeBalancer to a
different machine.
At 19:55 UTC, all services became unresponsive. The DevOps team were
already in a call, and immediately started to investigate.
Postgres was running at 100% CPU usage due to a VACUUM, which caused
all services that depended on it to stop working. The high CPU left the
host unresponsive and it shutdown. Linode Lassie noticed this and
triggered a restart.
It did not recover gracefully from this restart, with numerous core
services reporting an error, so we had to manually restart core system
services using Lens in order to get things working again.
At 13:24 UTC, we noticed the bot was not able to infract, and
pythondiscord.com was unavailable. The
DevOps team started to investigate.
We discovered that Postgres was not accepting new connections because it
had hit 100 clients. This made it unavailable to all services that
depended on it.
Ultimately this was resolved by taking down Postgres, remounting the
associated volume, and bringing it back up again.