service monitoring does not require being at desk

This technical blog post is intended for system administrators. If you are not person managing your computer then show little help and appreciation to one who does – pass link to this on,  maybe he/she will get any ideas from following article.

5 tips in short

  1. monitor each of  your server/OS vitals, do not care for own services at this step
  2. set up notifications for exceptional situations your software, but only exceptional to reduce noise
  3. add third-party service monitoring to your public service, to catch possible problems between world and your datacenter
  4. require each of your internal services to include “dependency check and report”
  5. monitor your backends, checking each service on each backend from each gateway. Get notified for anomalies such as slow or missing responses

How did we get there

Toggl.com is web service providing time tracking functionality. Both machine accessible API, and human-oriented website. For service to be available 24/7 for customers across the globe it needs to be monitored closely. Any problems that appear should be detected and fixed before they are causing any outages. Through our experience we realized that monitoring is a bigger challange than most people think.

A lot of popular services (like Pingdom, New Relic and really lot’s of others) can ease monitoring burden. People sign-up for one of them and think that they have their monitoring needs covered. Alas we have yet to see complete “drop-in” service that could solve problem completely. Our solution was to combine some of the popular services that we could easily integrate with some in-house developments. We feel that new start-ups could re-use some of our experience, and to make that happen we are open-sourcing a part of it.

First your service needs to have at least some monitoring, this is more obvious today, but it still has to  be said again and again. What to monitor, you ask? Well basically everything.

Start with server

For each server we recommend monitor CPU, disk IO and free space, memory usage and swap (truth is that swapping out is not a problem, reading from swap is very bad), this is to catch immediate problems. Often forgot, but also needed is fork rate, network connections count and processes/threads counts, this information is needed to plan scaling.  In Toggl we currently use self-hosted munin. We picked munin because it is easy to extend – we have written some custom plugins to monitor own applications and even business data over long periods of time.

Report errors from code

Next obvious thing is to monitor errors and exceptions in your own code. There are again a lot of  good third party solutions available as service. All you need to include some client in your application and all errors will be sent to central server. *warning* by default such clients often send sensitive data to third party server, this may include for example user emails, passwords or authentication data such as cookies. Truth is that you do not want this. And luckily this default can be often customized. In Toggl we have taken care, so that sensitive data is not sent to any third party.  At some moment we realized that for such service to be useful it has to include only really exceptional cases. User entering email with multiple @ marks – not exceptional. User importing time entry that has ended two days before it started – not exceptional. User getting any error saving valid data – this is exceptional and programmers must look into it, even if no one contacted support. Basic assumption – as you keep getting these exceptional things you can be sure that your service is working and is being used. What does it mean if you don’t get a single exception a day? Either your service is perfect or no one is using it.

Data center internet connectivity

Your next best friend – services that allow you to enter any URL and get notified if this URL did not respond correctly. Good enough. Until you grow…

Growing and adding servers to single service. There are a lot of reasons why this happens. Machines are added to handle increased demands. Sometimes you have some new, highly experimental feature, and you do not want to allow possible bugs to ruin existing service. Or you do not want any server problems (they will happen, matter of time) to make service unusable so you duplicate each server used for service in a way that if one server dies – service is not interrupted for end-users. Now you get different parts of your service handled by different servers, which may become unusable at any time.

Your service that you provide to your customers is now combination of different services (but this is not visible to customer). Each your internal service has some internal dependencies (for example it may need access to database and to remote file system to store user avatar). Different services may use different databases and number of things you need to monitor grows. Verifying single URL giving success is not enough anymore. If database used for secondary activity goes down it will be customers who find your service not working, and this will put your support team under heavy load.

You quickly add some more URLs to your service monitoring  solution.  Eventually thing seems to be perfect. You go home, knowing that now your service is perfect everything (starting from DNS to database) is duplicated and will work even if some servers die. Time passes and on one night you get alerted that service is completely down. In a few minutes you already know reason – one of duplicated servers was not working correctly for a few weeks already and just now died other one. You built system that handles single server errors so perfectly that you even did not notice that some servers are not working.

Peek behind firewall

Third party service monitoring solution that can not see behind your firewall will not catch all errors. You will eventually forget to add some URL’s to third party monitoring tool when you modify services. And with DNS balancing – your external monitor will always be more lucky than your users.  Yet, you always need some kind of monitoring that is outside your data center, preferably it should be located across the globe, else you get into situation where everything reports ok, but datacenter internet connection is actually down, or for example is using cached DNS.

System administrator is expected to take care of it all, and all connections between machines as needed. There are some tools like nagios that monitor everything and a bit more…  Sadly they actually take quite a lot of time to set up correctly, and really, do you need have 20+ function Swiss army knife every time you wish eat something with a fork?

We eased burden for system administrators by making each our internal service, to monitor all it’s dependencies and keep track if they go down. Each our program providing part of internal service, on each server is aware that everything it needs is working. And you probably think that each program could call some kind of service and note that it is not working properly. This is almost good solution, but this will not catch if OOM killer had killed whole your application. So we decided that we will have small utility sitting on each gateway server and checking all of our internal services if they are ok. This all is happening behind firewalls, so information is not accessible to outside world. And because this is happening on same server that is multiplexing all our internal services into single service visible to outside world we also check all firewall connections with same utility as “side effect”.

Our solution for internal service monitoring

For utility we decided to go with client-server. Client being very simple, and simply getting list of services to check from server, checking each one on given interval and posting everything back to same server. For client we picked Google golang as we did not want to install anything onto gateways in order to support supportive functionality – single binary with resident set of memory of about 8 MB is what we have  as result of this decision. Simple upstart script ensures that it is actually running.

Server feels almost like single point of failure, but this is single installation what reports all it’s dependencies to pingdom. If this single server will go down, or have troubles with any of it’s dependencies – we will know in matter of seconds. All the decisions are made by server – it keeps track of all outages and slowdowns (yes – our utility measures also time it took to report status over internal network ahead of slowdown being noticed by our users) and notifies us based on urgency.

Our server implementation is currently tightly integrated into our backoffice. We are not realising it at this time, and instead simple sample server is included in client source code. By the way, we released client source code under BSD 3-clause. You will find it  in our github repository. We are not releasing any binaries, because we strongly believe that you should not ever run any binary on your production server if you not sure what it does. And only by inspecting source code you can know what does it do.

This is how we currently solving monitoring problems @toggl.
How is your server and service monitoring solved, can you rely on your solution?
Leave a comment and let us know!