I have to start with an apology – we are very sorry we had more server issues yesterday. It seems our current infrastructure is giving up on us and we have to really work hard on beefing it up.
Here is the recap from the previous report:
1. We are re-working our infrastructure to use technology able to process more requests;
2. We are improving our monitoring of services, to catch probable bottlenecks before issues arise;
3. We are beefing up our Database Servers to handle more load;
4. We’ll be honing our communication skills so that you, our users are never out of the loop about what is happening.
The first part is already well underway and this did help us a lot while trying to get yesterday’s situation under control. The last time, all of Toggl was offline, but this time the public web worked fine. Unfortunately, this cannot be said about our API or Databases.
Server maintenance on February 14.
Next step from the above list is to upgrade our database server on February 14. This means Toggl will be unavailable for 3 hours from 03:00-06:00 EST (08:00-11:00 UTC)
More maintenance windows are coming in the future, but we will announce these in advance, along with explanations of why they’ll benefit you all.
The technical report:
Here’s a bit more detailed look at what happened. The good news is, steps taken since last downtime are working. We did not experience any issues with our balancers or front-end servers, and the public web was reachable at all times. I’d also like to stress that this outage was not related to our new web at all, it was just an unfortunate coincidence. There was also no loss of data from the servers, nor any breach of security.
This time our API gave up a ghost. We are still investigating the reasons behind it, but the chief culprit seems to be the Toggl Desktop and its Timeline feature. This generates excess traffic and it does not confirm to some responses from API servers. Therefore there will be new versions released for all Toggl Desktops and we urge everyone to upgrade.
Other factors at play were some external tools using our API. There were excess requests and insufficient processing of server responses that lead to multiple, unnecessary repeated requests. To handle these situations there will be rate limiting enforced soon. We’ll be clarifying our policies on API usage and will be contacting all the developers.
And finally, third reason is the user base growth above our predictions. As I said the last time, we just weren’t ready for it. Unfortunately, “beefing up” our infrastructure will consist of small steps and this takes time. Some of these steps also require taking Toggl offline, so it will have to be communicated in advance (which we intend to do).
But even together, all of these things shouldn’t be able to just kill the servers – they might make them slower than usual and give some early warnings – but they didn’t. Now looking at the situation, we can say that one of the API servers went offline. That meant all remaining API servers had to manage all the load – and it turned out to be too much for them.
Our monitoring alerted us of Toggl being down and naturally we first checked the balancers, as those were a problem last time. This time though they were OK, happily serving the web. Database load on the other hand was nonexistent, meaning we had problems with our API infrastructure. We quickly added another machine and then API started clearing up. Unfortunately, all sessions were lost during the outage, meaning all users were logged out of Toggl. And then they started all logging back in, all at once.
For the next hour or so, our server handling sessions got hammered really bad. Only until it got through all the backlog did the situation improve, allowing people to use Toggl as normal. That is another part of Toggl that is going to get an overhaul in the coming weeks, making the system more resilient.
Again, I must stress how sorry we are that this happened – and so close to the last time. We are working on improving the situation so that Toggl would be an even more awesome time tracking tool as it is now.