Notes on the July 21 downtime

We had major service issues earlier today. Starting at Thu, 21 Jul 2016 4:53 UTC we deployed some bad code and we pushed out a fix at 17:49 UTC. During this time email importing and deliveries were affected. We have now restored all services and all emails should have been imported and sent out. This is one of the worst service outages we have had in the history of the company and we are truly sorry about it.

What happened?

Since launching the new design, we have been working on improving performance and stability. As part of that effort, we have been adding more instrumentation using New Relic and yesterday we added instrumentation to understand our usage of Redis. Unfortunately the library we used to monitor Redis usage ended up breaking our background jobs (jobs that are happening behind the scenes, like importing emails or delivering emails). The web-app itself was up and functional and unfortunately our alerting system Pingdom failed to alert our devops engineer. We were alerted of the error in due time, but it took us a few hours to locate the issue since we have never seen anything like this before. Once we deployed the fix, it took us a few hours to clear out the backlog of emails.

What are we doing to fix this?

As I mentioned, we have been working on improving the stability of the system, and this issue has shown us some of our weak spots. A few things we are going to improve in the next one month.

  • Better error monitoring for our background jobs
  • We'll switch to Pagerduty so multiple team members can be alerted in case of service issues
  • Rewrite some of our codebase so we can clear out email backlog much faster in case of a downtime
  • A thorough review of our codebase and infrastructure to find issues and fix them proactively

We know that you depend on SupportBee to run your business and we are extremely sorry about this issue. We have let you down and there isn't a good way to make up for it. We will however do our best to improve stability of the system so these issues are not repeated.



blog comments powered by Disqus
Hana Mohan

Hana Mohan