Stopping Armageddon: A Discussion on Monitoring

This isn’t a post about how to set up specific tools. It’s more of a discussion on the concepts and processes that you should think through during development and implementation. This may end up being a series where I dive deeper into each of these bits.

We don’t write bulletproof code. Sometimes it’s hard to acknowledge that, pride gets in the way. Even if you think you have worked out all the kinks, all the bugs.. there will be something, somewhere.. some kind of edge case that will come out to haunt you. So, how do we prepare ourselves for that moment? How do we ensure that we have all of the pieces in place to stop armageddon before it starts? Logging and monitoring.

For a lot of us, logging and monitoring are something we think about at the end of a project. The focus is always on how can we accomplish that problem being presented in front of us, but not how are we going to monitor it going forward. We need to change this line of thinking. Thinking through the operations side of that feature in the beginning of the project allows us to weave these changes throughout rather than tacking them on at the end.

You may be asking, “those are some nice words Nathan, but what should we actually be doing?” The first thing we need to do is define our key metrics.

Define

The answer isn’t to monitor all the things, all the time. That’s nice to say, but near impossible to actually do. No, what we actually need to do is identify things that represent what makes or breaks our application. What is the goal of the project and how can we quantify its’ success?

Monitoring exception rates, request durations, etc., are a good starting point. However, we need to also be thinking about the non-technical metrics. It may be that the request durations have slowed by 10%, but is that really impacting the grand scheme of things? Again, the project we’re working on is trying to solve some kind of problem. How do we measure that?

For example, if you’re building an ecommerce site, you may want to monitor customer sign-ups and orders. If it’s a loyalty program, maybe you should watch enrollments, activities recorded and rewards redeemed. If it’s a product reviews API, your metrics are probably reviews written. Point is, each project you work on should have something measurable that shows whether or not it’s accomplishing it’s goal. That’s your key metric.

Log

We have our key metrics identified, now what?

It’s not enough to just be able to query the metric in some database to get the current value. We need to see that metric logged over time so that we can establish a baseline and see a trend. There are a plethora of tools that can help you do this; Grafana, Application Insights, and Geckoboard to name a few. Or… it could just be as simple as creating a new table in that database to summarize the metric over time. Pick something and ensure it’s updated frequently.

Good, now we can tell when all hell is breaking loose. However, it won’t do us any good if we can’t find out why.

While working through your development tasks, make sure you’re keeping those key metrics in mind. When catching an exception, log enough detail and context to help yourself out later. If the exception is related to a specific customer account for example, make sure you’re including that account’s ID or some other identifier. Don’t be afraid to log more than just errors either. As long as you use your log levels appropriately, you can typically filter out the lower levels until needed. Storage is cheap and more information is better.

Monitor

At this point, we have our metrics defined and some supporting messages being logged. What’s next? All that’s left to do is designate someone to query this data continuously and let the team know if any issues arise.

No, what we need to do now is set up some alerts. This is arguably the most important piece of this process. Without decent alerting, you are still in a reactive state. You won’t know anything is happening until someone is busting down your door.

Remember those metric trends that you set up earlier? Use those to figure out the threshold that the metric should fall within. Over a 5 minute period, maybe you normally see 30 - 50 customer signups and 15 - 20 orders placed. If the number ever falls lower than that, you should be notified. Think of a threshold that you’re comfortable with and set up an alert to fire whenever it falls outside of that safe zone.

Communicating alerts can be a tricky thing as well. Most places default to logging everything to email. However, in a world where we are being spammed with email from everywhere, it’s easy for things to get lost. If email is all you have though, just ensure that you set up proper rules / filtering within your inbox to handle it. Email is definitely the easiest to get up and going.

On top of that, I would really encourage looking at the tools your team uses for day-to-day communication (please don’t say email). Maybe you use Slack or Microsoft Teams.. Hipchat.. Skype.. AOL? Whatever it is, my recommendation would be setting up a place within that application for these alerts to land. You’re in that tool all day anyway, it makes sense.

The other option (maybe in addition to) is SMS. If you go this route, I highly recommend only using it in urgent situations / when necessary. No one would like their phone blowing up constantly with alerts saying ‘site performance degradation by 5%.’

Summary

There are going to be bugs, there are going to be issues that happen in production. That’s the nature of technology. However, with a few changes in habit, we can better prepare ourselves for those issues. By starting the discussion up front, we can ensure we’re thinking through the monitoring pieces with each new piece of functionality we develop.

2019.05.18Stopping Armageddon: A Discussion on Monitoring

Define

Log

Monitor

Summary