General Toggl News

Notes on Yesterday’s Server Problems

Yesterday we had our second longest unannounced downtime ever, and unlike last time (when our provider’s power network experienced a meltdown), this downtime event was entirely due to a fault of our own.

We dropped the ball here, and we’re really sorry.

Below you’ll find notes on what caused the issues, and what lessons we learned from the event.

 

Timeline of events

Toggl was down for a little over an hour, with smaller performance issues occurring over the course of several hours.

  • At 14:45 UTC yesterday we spotted first problems. Queries were taking too long, and we were alerted.
  • By 15:00 UTC we were down and stayed down for over an hour. We then experienced intermittent connectivity for around 3 hours.
  • By 19:30 UTC, the service was back in limited mode.
  • Last issues were cleared by 20:45 UTC.

Cause of server issues

The reason behind the initial crash was really simple – a piece of code relating to improvements was not optimized enough and thus took too long to complete.

As it was one of the most used bits, it meant the situation was gradually getting worse and worse, with more and more people waiting for data. until we ran out of resources on the database level. That’s when the initial crash happened.

 

Technical details & our response

We spotted the issue and started mitigating, first by fixing the inefficiencies and then by adding resources. In the meantime, as the database end of the system was malfunctioning, the API machines were experiencing their own problems due to excessive traffic and hundreds of thousands of reconnects happening simultaneously. When it became too much to handle, the API servers dropped out of the pool. We countered that by adding more API resources.

Fixes and added resources didn’t have a lasting effect so we decided to revert all changes to code to its last known good state from last week.

This did not improve matters. On the contrary – we suddenly couldn’t get the site running at all.

We struggled quite a while with causes until we spotted a suspicious detail – all the new database resources we had added were sitting idle. By re-checking all the setup it turned out most of the API servers had dropped out of the pool of available resources, and all the traffic was served by a couple of remaining API servers.

Yes, you read that right –  at one point we only had two operational API servers running and we failed to notice that for the better part of an hour. 

That’s what I meant when I said we really dropped the ball here.

Turns out the actual code issue was fixed within 30 minutes, but most of the time was spent on chasing ghosts – trying to fix things that didn’t have any effect and running circles around the real problem – API servers, that were just nonfunctional.

As soon as we figured that out, the service was back up – but we had lost several hours.

 

Aftermath & lessons learned

This has been a very painful, but also a very valuable lesson for us.

It was a case of not thinking clearly, trying to fix things that weren’t really broken and ignoring things that did deserve attention.

That is down to the processes of how we operate and the critical data we are monitoring. We plan to improve on both.

I cannot promise you downtimes won’t happen ever again – in the world of ever more complex IT solutions and ever growing data, some errors are bound to happen at some point in time. But we will learn from this particular case and will do our best to anticipate similar scenarios.

 

I’d like to apologize once again for the inconvenience the downtime may have caused you.

Thank you for your patience and trust in us.

 

Krister Haav

Toggl CEO

 

By On January 17, 2018

  1. Thanks so much for such an honest appraisal: of the cause of the problems, but mostly your statement “we will learn from this particular case and will do our best to anticipate similar scenarios.” We are all only human, and this is an honest, human acknowledgement of both the initial problem as well as our mutual human imperfections. Bravo!

  2. This is how you put out an apology. I wasn’t affected by the outage as I wasn’t using Toggl at the time but I just wanted to leave this here. Really impressed with the honesty and the clear determination to improve. Well done.

  3. Has this effected the desktop application’s ability to sync with the website? Most of my team is having issue with the desktop app today.

  4. Besides everything else I love about Toggl, your transparency and honesty keep me amazed. Kudos, Krister!

  5. It seems when Toggl crashed yesterday and I was able to get back in a message popped up and I cannot get rid of it or find anything about the message. This is causing performance issues with Toggl and have not used it much today. The message is: Missing callback. See the log file for details. Not sure what callback or log file is being referenced. Tried uninstalling and installing and still receiving the message. Anyone else experience this after yesterdays issue?

  6. Thanks for the report, this is really professional and deeply appreciated. No worries, website is great and appreciated!

  7. Great write-up! I only happened upon this because I logged in on the web interface and noticed that changes I made to data collected hadn’t persisted. I’d like to see an email (push notification) sent to my registered email alerting me to this having happened so that I can validate my data entries.

    At the same time, I’m on the free tier and have no room to ask for anything! (I’m more than willing to start paying for your WONDERFUL service, but apparently don’t need any of your advanced features.)

  8. No problem! I value when people admit their troubles… We’re humans after all! Let’s move forward, targeting bigger goals! Than you for the free service for a lone, clueless developer! You’re helpingm e a lot!

  9. I’m a small customer so the issue didn’t affect me much but I really appreciate the straight-up explanation of the situation and hope that the situation doesn’t happen very much in the future. I’m sure it must have been a brutal day for a bunch of people there as well as for customers who rely more heavily on Toggl.