As you may remember, Toggl’s servers experienced some difficulties last Thursday. At the end of it, we promised to offer an explanation on what exactly had happened, and what we learned from it. The following is a full report on the server issues:
1. Managerial Overview
First things first – the main culprit were ourselves, mainly for not being prepared enough for the start-of-the-year surge in new users. We did not anticipate user influx quite as large as it turned out to be, and were thus ill-prepared for the spikes in traffic. The load on our infrastructure grew to a point where it all broke down, simple as that. And we are really, really sorry about it.
The good news is that we were able to patch it by adding more servers and turning off some of the non-essential stuff. The bad news is, this isn’t a permanent solution, so we have to deal with those issues in the (very near) future. So without further ado, here’s the shortlist:
- We are re-working our infrastructure to use technology able to process more requests;
- We are improving our monitoring of services, to catch probable bottlenecks before issues arise;
- We are beefing up our Database Servers to handle more load;
- We’ll be honing our communication skills so that you, our users are never out of the loop about what is happening.
Again, we are very sorry this has happened and we’ll be working hard to prevent these issues in the future.
2. Technical Overview
When disaster struck, initially we thought our Go API was the culprit as we started getting a lot of Internal Server Errors. We tried restarting the services, then servers, double- and triple-checked everything and it seemed to be all right. Then our Database load started falling and we knew we had hit something upstream. We then started eliminating possibilities – pgbouncer? No. API itself? No, running fine. Nginx? Not applicable, it only serves the web. Balancers?
There was the problem child of ours. The API errors people saw were actually Pound timeouts, set to 30 seconds. We upped this to 5 minutes to see if something changes and – lo and behold, Toggl started working again (albeit very slowly).
We then concluded that we are being swamped by incoming requests. We investigated further, trying to see if there is some irregular pattern or DDoS-like activity – but most of the traffic was legitimate. What we did observe, was that there were many requests trying to post incorrect data, e.g. Time Entries for the projects the user had no access to, very old Timeline data etc. Those were the easy targets but unfortunately impossible to fix, as it meant we would have to release new versions of all clients and make everyone around the globe update theirs ASAP – and this was not very reasonable. The rest was just regular buzz of loads and loads of incoming requests.
Seems we have hit a limit with our old configuration of small Rackspace machines and Pound as a SSL terminator/balancer. It probably started up as a minor network loss for one of our balancers which triggered a lot of reconnects when it re-appeared online, creating a lot of waiting connections that got timeouts, triggered more sync requests etc. It was self-defeating. The relief came when we added another balancer to the pool, which then got enough performance to chew through all the backlog of requests.
Therefore we will be moving towards a new setup with nginx as a hybrid SSL terminator, balancer and frontend (static) files server. There will be updates to all our tools (web, desktop, mobile etc.) to use backoff on errors and gentle waking up if the situation is resolved. We’ll add correct handling of response 400 to the clients, because ignoring this creates unnecessary traffic that only builds up with time. There will be rate limiting to catch rogue clients who make too many requests. And we are going to fine-tune our monitoring to include more important stuff.
To make Toggl more resilient and provide more headroom for the future we’ll be upgrading the Database server. It will happen on February 14th and for a couple of hours the whole Toggl will be unavailable. There will be another maintenance later to further optimize the new machine and our database.
We will announce all maintenance windows beforehand and in detail, so please stay tuned.