How 2wav Monitors Web Applications in the Cloud - Part 3
Our previous blog posts in the series examined two levels of monitoring - basic resource utilization and application stack monitoring.
Monitoring Your Infrastructure Provider
In addition to the resources that you provision, cloud-based applications rely on a number of other moving parts to keep running. That server that you’re running your application on is actually running within a set of other highly specialized applications that are prone to faults of their own. While the applications that host your server aren’t under your control, issues within those environments can have direct impact on your server and ultimately your application.
In order to save yourself some headache chasing down application outages that aren’t due to issues with your application or your server, it’s important to monitor the environment that you’re hosting on. Most modern hosting providers offer status pages that report on the health of the various services that power end user applications. Keeping an eye on these may mean the difference between wasting two hours chasing down an imaginary server issue, and two minutes composing an email letting your users know about the outage and assuring them that it’s being worked on (albeit by your hosting provider).
At 2wav, we are heavily reliant on AWS for a number of our clients. We also use Slack a lot. AWS provides a status page with RSS feeds. Slack provides an app that can consume RSS feeds. Could there be a more perfect match? With Slack watching the AWS RSS feeds, we get notified in our #alerts Slack channel whenever anything is amiss at AWS.
Watching The Watchman
When deciding what to monitor there is a blindspot than can easily be missed: the monitoring system itself. If your monitoring system goes down, then you may not notice, especially if things are working well most of the time and you’re not used to seeing notifications in your inbox or Slack channel. Failure of your monitoring system can be mitigated if you are using a distributed monitoring system, or if you use a cloud service provider who will no doubt have a distributed and redundant deployment. However, what if you’re not a big shop but you roll your own monitoring system, and can’t justify the overhead of a distributed system?
To fix that dilemma, 2wav uses a cloud monitoring provider to monitor our monitoring system. Monitor.us, which we mentioned in our first blog post in this series, offers a free tier which is excellent of monitoring our Sensu server for failures. Using this service, we monitor the processes on the Sensu server that support monitoring (sensu-server, sensu-client, sensu-api and redis-server). We also monitor the Sensu dashboard using an HTTP endpoint monitor. While we have had no failure of Senu to date, this extra level of monitoring gives us some added peace of mind allowing us to focus on serving our clients.
Notifications
When things go wrong, you’ll want to ensure that the right people are notified. In larger teams, you’ll want to route specific kinds of alarms to specific teams, or team leaders. In smaller teams, you may not have clearly defined roles, and just notify everyone.
At 2wav, we use a Slack channel to notify the team of failures. Many monitoring tools and frameworks offer integration with collaboration tools such as Slack, and this can be used to notify the entire team (a general channel), or notify specific team (the server channel, or the DB channel). If you use Slack and decide to give Sensu a try, then you should check out the Sensu Slack chat handler.
Another thing to bear in mind with notifications is filtering out false positives. If users are inundated with false warning of site outages, then chances are they will develop a habit of ignoring them, including that one time when the wolf is actually at the village gate.
The Sensu check attributes “interval”, “occurrences” and “refresh” help to reduce the alert chatter in the 2wav #alerts Slack channel. “Interval” specifies how frequently checks should be run. When combined with “occurrences” you are specifying a threshold for failure. For example, an occurrence of 3, means that if your check fails 3 times you will be notified. If your “interval” for checking is 30 seconds, then in the even of an actual issue you will be notified within a minute and a half.
If you find that your configuration generates too many false positives, consider increasing one or both values taking into account how sensitive your application is to downtime. While you want to know as soon as an issue develops, a 1 minute difference in notification times may mean the difference between your inbox or Slack channel being filled with false positive that you have to clear, or being notified only on true failures.