Partial API outage

Incident Report for Honeybadger.io

Postmortem

This report details the impacts of our outage of August 26th, the cause for the outage, and steps we have and will be taking to prevent a similar kind of outage in the future. Before getting into the details, I want to apologize to everyone who was impacted by this outage – we have worked hard to build a resilient system, and it’s really disappointing when we let you down. On to the deets…

What happened?

A little after 7PM Pacific Time I received an alert from PagerDuty letting me know that one of our ingestion API endpoints was not responding to external monitoring. Reviewing our dashboards revealed that our primary Redis cluster was about to run out of available memory. This cluster stores the Sidekiq queue that we use for processing the payloads of the errors being reported to our API, and it typically rests at 3% memory utilization. Our internal dashboard did show some outliers among our customers for inbound traffic, but a few minutes of research down that path did not lead to a cause for the memory consumption. Running down our list of things to check in case of emergency led me to find that our database server was overloaded with slow-running queries. This caused our Sidekiq jobs to take much longer than usual (30-50x as long), which caused a backlog large enough to consume all the memory. Having our main Redis cluster effectively unusable resulted in the following problems:

Our API endpoints were unable to receive error reports, source map uploads, deployment notifications, and check-in reports
Uptime checks were delayed
Some check-ins were switched to the down state because the check-in reports couldn't be received
Some error notifications were lost as we had to juggle Redis instances (more on that below)

When was it fixed?

Not long after 9PM our API was responsive again. By 10PM our backlog of error payloads was fully processed, and we were back to normal. We would have been back in business sooner, but a few things tripped us up:

As soon as it was clear we were going to run out of memory on our Redis cluster (which is hosted by AWS ElastiCache), and that I wouldn't be able to quickly free up some memory, I started a resize of the existing cluster. When it became clear that would not be quick (it ended up taking approximately two hours), I spun up a new, larger ElastiCache cluster. When it become clear that would not be quick, I spun up an EC2 instance in our VPS to host Redis temporarily.
Unfortunately, though we use Ansible for automating all our EC2 provisioning, we did not have a playbook for quickly spinning up a Redis server. When I set up the first server manually, I didn't provision enough disk space on the instance to store the Redis snapshot as the backlog grew (I didn't have the fix in place for the slow queries yet).
When I spotted that problem with the Redis instance, I spun up another one with a large-enough disk.
We also didn't have an automated way to update the 4 locations where our app, api, and worker instances were configured with the location of the Redis server, so with each of the two changes to the Redis server location I had to run some Ansible commands to update configurations and bounce services

Once that was all settled, though, and traffic was once again flowing in to our new self-hosted Redis instance, I was able to turn my attention to the cause of the problem – the slow queries. It turns out that one query was the cause of the slowness. This query loaded previously-uploaded sourcemaps to be applied to Javascript errors as they were being processed. Since the problem was so localized, I was able to get the database back to a good place by temporarily suspending the sourcemap processing.

What's the remediation plan?

As you can imagine, there are a number of things we can do to help avoid or minimize this kind of situation in the future:

Review that database table and/or the query to see how we can get past the tipping-point we encountered that turned a 4-10ms query into a 400ms query for certain customers (in process)
Persist payloads to S3 sooner in our Sidekiq jobs so we can minimize memory pressure on Redis
Increase the size of our Redis cluster (already done)
Create an Ansible playbook to quickly provision a new Redis instance in case of emergency (done)
Centralize the 4 app configurations for the URL of the Redis cluster and create an Ansible playbook that can quickly update those configurations (in process)

We'll continue to improve our systems and processes to deliver the most reliable service we can. We truly appreciate that you've chosen us for your monitoring needs, and we are always eager to show our appreciation by working hard for you. As always, if you have any questions or comments, please reach out to us at support@honeybadger.io.

Posted Aug 27, 2019 - 12:26 PDT

Resolved

The backlog has been cleared, and our Redis cluster is happy once again. We'll be looking at ways we can better handle this scenario in the future.

Posted Aug 26, 2019 - 22:10 PDT

Update

We have a temporary fix in place for the impacted Redis cluster, and now we are working on the backlog.

Posted Aug 26, 2019 - 21:39 PDT

Identified

Our main Redis cluster is having issues, and we are attempting to work around them

Posted Aug 26, 2019 - 20:17 PDT

Investigating

We are currently investigating this issue.

Posted Aug 26, 2019 - 19:21 PDT

This incident affected: API.