Last week in Portland, I attended Monitorama, a conference on open Source monitoring. Speakers and attendees demoed fancy new software, shared personal experiences, and – in true cross-disciplinary learning fashion – showed off lots of cool math! Keep your eye on this Vimeo page for forthcoming videos from the conference. In the meantime, let’s run through some of the conference highlights.
The current state of monitoring
Unfortunately, the current state of monitoring systems isn’t much better than it was 10 years ago. We still monitor systems using the same software, running on a monolithic server, with the same set of probes that have little to no intelligence built into them.
We still dread being on call, and all of us suffer from the same on-call fatigue — missing sleep, missing time with friends and family, and having a Pavlovian fight-or-flight response to that familiar buzzing or ringing of the on-call phone.
We still tell new hires, “Look at these graphs and familiarize yourself with how things look.” The very same graphs that we stopped looking at after a few months on the job!
There’s hope for the future
It’s not all doom and gloom, mind you. We can use science! to predict failures and find anomalies in our systems. Talks from Toufic Boubez, Noah Kantrowitz, Dr. Neil J. Gunther, and Baron Schwartz focused on taking a more scientific approach to monitoring, but they also emphasized that there is no silver bullet. For example, I can’t check out some magical project from GitHub, install it, and watch it crunch all my metrics that I can then use to spit out intelligent predictions and analysis.
However, there are steps we can take now to improve the current state of things:
- Build robust monitoring systems – Monitoring systems should be as robust or more so than the things they monitor. Too often we run our monitoring system as a single point of failure. If a system failure is critical enough to wake us up at 2 am, then why don’t we architect monitoring systems that are themselves fault tolerant? Build in redundancy, regularly test the systems, and have multiple paths of communication in case of failure. Monitor the monitoring systems themselves for failure.
- Monitor if work is happening – If the site is up, do you care at 2 am that the load average spiked? Use that information for reporting and trend analysis in the waking hours, but don’t wake anyone up to tell them everything is fine.
- Avoid alert fatigue – We want the phone going off to mean something, so ditch or turn down everything but the truly critical states. Work with your peers to create a system in which you can temporarily hand off responsibility for discrete periods of time (bathroom, shower, putting kids to bed, or you had a rough night and just need a nap). Make this system as simple and reliable as possible.
- Escalate quickly and be persistent – If something really is critical, then our systems should notify us loudly and often. They should over communicate – send a page, send an email, pipe something to a chat – so that other people can assist and gather data.
- Create hand-off reports – Communicate recent alerts to the next person or team on duty, so everyone knows the current state of things and what problems to expect. These reports can also help you track what problems are recurring week over week.
- Rely more heavily on an open chat stream – For communications during an outage, integrate alerts into chat to build a timeline of an outage. The chat can later serve as a log of what steps were taken, and provide you with a chronology of the outage. Use the chronology to review what happened and help decide how things could have gone better.
- Create playbooks for outage response – A “playbook” helps with shared knowledge and documentation across teams, and it also helps your future 2 am self when your brain won’t be working as well. You’ll need all the help you can get.
One theme that was persistent throughout the conference was a repeated call to concentrate on data analysis, instead of piling on more and more monitoring. This focus on analysis is a distinct shift from a past in which we wanted tools that would give us more data about our systems. Now we have the data (maybe too much), and we need tools to help us analyze, store, and process that data. Tools like Graphite, ElasticSearch, Kibana, Logstash, and Heka (shameless self-promotion) are the keys to making sense of all that our systems are trying to tell us.
Another theme of the conference was the growing adoption of the Lua and Go programming languages. Lua is a favorite for its speed and ease of embedding into existing applications. Go is favored for its concurrency and speed. Go is a compiled language which in itself is a trend, marking a shift away from interpreted languages. Go code compiles extremely fast, and in our heterogeneous environment, a single, compiled binary is a welcome change over the multitude of version and language dependencies we have to manage for other projects.
Some wicked cool software
There was a lot of interesting software presented at Monitorama. The most interesting to me though were these gems:
Flapjack – monitoring alert and routing system (http://flapjack.io/)
- Segregates responsibility for self-service monitoring
- Takes input from multiple monitoring sources, aggregates them into Flapjack, and then outputs them to multiple destinations
- Includes end-to-end self-testing and alerts if messages are not flowing through the system
That last bullet is really important to me. If you’re going to put all your critical alerting data into a system, it better have a way of telling you when it fails.
Dashing – easy dashboard generator (http://dashing.io/)
- Runs on a Raspberry Pi or Chromecast.
- Uses server-sent events
- Is not a third party you send your dashboard data to (i.e., you own your own data)
Wiff – packet processing pipeline (https://github.com/wayfair/wiff)
- Provides network analysis
- Is basically a real-time pcap to JSON converter
- Allows you to use all the cool existing JSON import tools for analysis
Ideas I would like to pursue here at Safari
I’d like to see us start thinking of monitoring as a service. Can we create an alert and log message pipeline so that others in the company can subscribe to certain portions and build tools? Can we make tools that allow for monitoring and message passing to be self-service so that operations isn’t a bottleneck for setting up monitoring and logging, and some of the responsibility can be shared with the appropriate people. I imagine a system where the data flows freely and really smart people (way smarter than I) make awesome internal tools.
||Effective Monitoring and Alerting describes data-driven approach to optimal monitoring and alerting in distributed computer systems. It interprets monitoring as a continuous process aimed at extraction of meaning from system data. The resulting wisdom drives effective maintenance and fast recovery – the bread and butter of web operations.
||In Advanced Mathematics for Applications, Andrea Prosperetti draws on many years’ research experience to produce a guide to a wide variety of methods, ranging from classical Fourier-type series through to the theory of distributions and basic functional analysis.
||Lua offers a wide range of features that you can use to support and enhance your applications. With Beginning Lua Programming as your guide, you’ll gain a thorough understanding of all aspects of programming with this powerful language. The authors present the fundamentals of programming, explain standard Lua functions, and explain how to take advantage of free Lua community resources. Complete code samples are integrated throughout the chapters to clearly demonstrate how to apply the information so that you can quickly write your own programs.
||With Programming in Go: Creating Applications for the 21st Century you’ll learn how today’s most exciting new programming language, Go, is designed from the ground up to help you easily leverage all the power of today’s multicore hardware. With this guide, pioneering Go programmer Mark Summerfield shows how to write code that takes full advantage of Go’s breakthrough features and idioms.