Primary Kubernetes node outage
We had an outage of our highest spec node due to CPU exhaustion. The outage lasted from around 20:20 to 20:46 UTC, but was not a full service outage.
Leadup
List the sequence of events that led to the incident
I ran a query on Prometheus to try figure out some statistics on the number of metrics we are holding, this ended up scanning a lot of data in the TSDB database that Prometheus uses.
This scan caused a CPU exhaustion which caused issues with the Kubernetes node status.
Impact
Describe how internal and external users were impacted during the incident
This brought down the primary node which meant there was some service outage. Most services transferred successfully to our secondary node which kept up some key services such as the Moderation bot and Modmail bot, as well as MongoDB.
Detection
Report when the team detected the incident, and how we could improve detection time
This was noticed when Discord services started having failures. The primary detection was through alerts though! I was paged 1 minute after we started encountering CPU exhaustion issues.
Response
Who responded to the incident, and what obstacles did they encounter?
Joe Banks responded to the incident.
No major obstacles were encountered during this.
Recovery
How was the incident resolved? How can we improve future mitigation?
It was noted that in the response to kubectl get nodes
the primary
node's status was reported as NotReady
. Looking into the reason it was
because the node had stopped responding.
The quickest way to fix this was triggering a node restart. This shifted a lot of pods over to node 2 which encountered some capacity issues since it's not as highly specified as the first node.
I brought this back the first node by restarting it at Linode's end.
Once this node was reporting as Ready
again I drained the second node
by running kubectl drain lke13311-20304-5ffa4d11faab
. This command
stops the node from being available for scheduling and moves existing
pods onto other nodes.
Services gradually recovered as the dependencies started. The incident lasted overall around 26 minutes, though this was not a complete outage for the whole time and the bot remained functional throughout (meaning systems like the help channels were still functional).
Five Why's
Run a 5-whys analysis to understand the true cause of the incident.
Why? Partial service outage
Why? We had a node outage.
Why? CPU exhaustion of our primary node.
Why? Large prometheus query using a lot of CPU.
Why? Prometheus had to scan millions of TSDB records which consumed all cores.
Blameless root cause
Note the final root cause and describe what needs to change to prevent reoccurrance
A large query was run on Prometheus, so the solution is just to not run said queries.
To protect against this more precisely though we should write resource constraints for services like this that are vulnerable to CPU exhaustion or memory consumption, which are the causes of our two past outages as well.
Lessons learned
What did we learn from this incident?
- Don't run large queries, it consumes CPU!
- Write resource constraints for our services.
Follow-up tasks
List any tasks we should complete that are relevant to this incident
- Write resource constraints for our services.