This report details the impacts of our outage of August 26th, the cause for the outage, and steps we have and will be taking to prevent a similar kind of outage in the future. Before getting into the details, I want to apologize to everyone who was impacted by this outage – we have worked hard to build a resilient system, and it’s really disappointing when we let you down. On to the deets…
A little after 7PM Pacific Time I received an alert from PagerDuty letting me know that one of our ingestion API endpoints was not responding to external monitoring. Reviewing our dashboards revealed that our primary Redis cluster was about to run out of available memory. This cluster stores the Sidekiq queue that we use for processing the payloads of the errors being reported to our API, and it typically rests at 3% memory utilization. Our internal dashboard did show some outliers among our customers for inbound traffic, but a few minutes of research down that path did not lead to a cause for the memory consumption. Running down our list of things to check in case of emergency led me to find that our database server was overloaded with slow-running queries. This caused our Sidekiq jobs to take much longer than usual (30-50x as long), which caused a backlog large enough to consume all the memory. Having our main Redis cluster effectively unusable resulted in the following problems:
Not long after 9PM our API was responsive again. By 10PM our backlog of error payloads was fully processed, and we were back to normal. We would have been back in business sooner, but a few things tripped us up:
Once that was all settled, though, and traffic was once again flowing in to our new self-hosted Redis instance, I was able to turn my attention to the cause of the problem – the slow queries. It turns out that one query was the cause of the slowness. This query loaded previously-uploaded sourcemaps to be applied to Javascript errors as they were being processed. Since the problem was so localized, I was able to get the database back to a good place by temporarily suspending the sourcemap processing.
As you can imagine, there are a number of things we can do to help avoid or minimize this kind of situation in the future:
We'll continue to improve our systems and processes to deliver the most reliable service we can. We truly appreciate that you've chosen us for your monitoring needs, and we are always eager to show our appreciation by working hard for you. As always, if you have any questions or comments, please reach out to us at support@honeybadger.io.