All services outage
At 19:55 UTC, all services became unresponsive. The DevOps team were already in a call, and immediately started to investigate.
Postgres was running at 100% CPU usage due to a VACUUM, which caused all services that depended on it to stop working. The high CPU left the host unresponsive and it shutdown. Linode Lassie noticed this and triggered a restart.
It did not recover gracefully from this restart, with numerous core services reporting an error, so we had to manually restart core system services using Lens in order to get things working again.
Leadup
List the sequence of events that led to the incident
Postgres triggered a AUTOVACUUM, which lead to a CPU spike. This made Postgres run at 100% CPU and was unresponsive, which caused services to stop responding. This lead to a restart of the node, from which we did not recover gracefully.
Impact
Describe how internal and external users were impacted during the incident
All services went down. Catastrophic failure. We did not pass go, we did not collect $200.
- Help channel system unavailable, so people are not able to effectively ask for help.
- Gates unavailable, so people can't successfully get into the community.
- Moderation and raid prevention unavailable, which leaves us defenseless against attacks.
Detection
Report when the team detected the incident, and how we could improve detection time
We noticed that all PyDis services had stopped responding, coincidentally our DevOps team were in a call at the time, so that was helpful.
We may be able to improve detection time by adding monitoring of resource usage. To this end, we've added alerts for high CPU usage and low memory.
Response
Who responded to the incident, and what obstacles did they encounter?
Joe Banks responded to the incident.
We noticed our node was entirely unresponsive and within minutes a restart had been triggered by Lassie after a high CPU shutdown occurred.
The node came back and we saw a number of core services offline (e.g. Calico, CoreDNS, Linode CSI).
Obstacle: no recent database back-up available
Recovery
How was the incident resolved? How can we improve future mitigation times?
Through Lens we restarted core services one by one until they stabilised, after these core services were up other services began to come back online.
We finally provisioned PostgreSQL which had been removed as a component before the restart (but too late to prevent the CPU errors). Once PostgreSQL was up we restarted any components that were acting buggy (e.g. site and bot).
Five Why's
Run a 5-whys analysis to understand the true cause of the incident.
- Major service outage
- Why? Core service failures (e.g. Calico, CoreDNS, Linode CSI)
- Why? Kubernetes worker node restart
- Why? High CPU shutdown
- Why? Intensive PostgreSQL AUTOVACUUM caused a CPU spike
Blameless root cause
Note the final root cause and describe what needs to change to prevent reoccurrance
Lessons learned
What did we learn from this incident?
- We must ensure we have working database backups. We are lucky that we did not lose any data this time. If this problem had caused volume corruption, we would be screwed.
- Sentry is broken for the bot. It was missing a DSN secret, which we have now restored.
- The sentry.pydis.com redirect was never migrated to the cluster. We should do that.
Follow-up tasks
List any tasks we've created as a result of this incident
- Push forward with backup plans