Few months ago, we decided to massively overhaul our infrastructure. It was a huge project and we've been busy at it since. We're finally approaching completion this month and I'm excited to share a bunch of updates with you.
Before disclosing what we've been working on the past few months, let me give you a brief backdrop. Back in July this year, SupportBee had a massive outage that lasted hours. While we oversaw outages before, a large percentage of them were minor and didn't affect users. July's outage though was the worst in SupportBee's history. While we could identify and mitigate the outage on time, it took our system hours to import and process all the emails that weren't imported earlier during the outage. Customer support is a time critical process and as the outage lasted hours, it affected our customers and in turn, our customers' customers. Once the outage ended, we promised ourselves to never let this happen again. We immediately began an investigation to identify the problem areas and started upgrading our infrastructure. Today, I'm happy to share the following updates with you.
We moved most of our background jobs, including email processsing jobs, from Resque to Sidekiq. Resque is a popular background job processing system. We've been using Resque at SupportBee since our initial days and it has served us well. Resque however can sometimes be very resource hungry (like in our case). Sidekiq is a popular alternative to Resque that promises to use resources more efficiently (For the technically curious, Sidekiq uses operating system threads instead of operating system processes to process background jobs) and is also optimized for performance. We spent the last few months adding Sidekiq to our infrastructure, making background jobs Sidekiq compatible and slowly moving each background job from Resque to Sidekiq all while ensuring customers remain unaffected. Also since Sidekiq (because of its use of threads) brings with it its own set of quirks, we spent a considerable amount of time investigating and fixing these issues.
Because a lot of background jobs are now on Sidekiq, we're able to process many more emails parallely that we could earlier with the same resources. This is incredibly useful in outage situations or in times of peak traffic.
The API now responds in less than 250 milliseconds. This is almost twice as fast as before. What this means is that your SupportBee desk loads quicker, tickets load quicker, replies load quicker and the whole experience feels a lot more snappy.
We were able to achieve this significant improvement by deploying two major updates. Firstly, we've upgraded our app to use Ruby 2. Ruby 2 is a newer version of the Ruby programming language that promises (and delivers!) more performance. Moving to Ruby 2 required a rewrite of a significant portion of our email processing infrastructure. Secondly, we moved the web interface to a server push architecture (the same architecture we use to show you new tickets and replies without a page refresh) and completely removed any remnants of polling. For example, to show all agents that are simultaneously viewing the same ticket as your are, the web interface earlier polled our servers every 15 seconds. Its very rare that two agents have the same ticket open simultaneously and this design caused an immense amount of traffic to be sent to our servers, especially during peak hours. Now, with the server push architecture, the server sends updates to the web interface only when necessary.
Like any large software, SupportBee has a lot of moving components. We've multiple databases, web servers, mail servers, mail importers etc. We decided very early on in our investigation that even the most stable components can fail (often in a subtle way) and that we've to monitor the system at every place possible. With this insight in mind, we moved forward and setup monitoring and alerts for every component we have. In fact, we've gone as far as writing our own monitoring plugin when we weren't happy with what was currently available. We started seeing benefits of having such extensive monitoring in the same week that we introduced it. Here's our monitoring plugin for example alerting us of a faulty mail server in time. Notice the 450+ deferred emails?
Email conversations are very time sensitive and being an email software, we care deeply about reliability and timely delivery. We hope you enjoy these performance improvements and improved reliability. We'll be publishing in-depth blogposts about the performance improvements on our Dev Blog. Follow our Dev Team on Twitter to receive these updates.
PS: The July 12 outage was caused by a reliability improvement we deployed a day before. Isn't that ironic :)