Why Cloud Monitoring Matters in the App Age

October 1, 2015 No Comments

Featured article by Dave Josephsen, Developer Evangelist for SolarWinds

The preponderant quantity of advances in IT and software engineering in the last ten years has been driven by scale. Ephemeral infrastructure and PaaS enable us to scale hardware, CI/CD enable us to scale the rate of production change, and microservices, eventually consistent data stores, and distributed systems engineering enable us to scale the applications themselves.

Scale is a strange and unforgiving property. It cannot be fooled, and is never fully satiated. It has a way of unceremoniously obliterating conventional thinking. It demands 350 new servers right now and laughs mockingly at your change management meeting and your tape backups.

But like anything insurmountable, scale brings out the absolute best of what’s inside us — of our humanity; it demands our creativity, improvisation, and truth, and punishes our sloth, assumption and ritual. It encourages us to explore and does not care whether we succeed.

Building applications that scale hyper-focuses our attention and resources on the things that actually matter (like automating the deployment pipeline) to the detriment of things that waste time and effort (like procuring hardware and maintaining our own telemetry systems). It forces us to spend our time on our core competency and waste as little time as possible on yak-shaving expeditions. The problem is, it’s not always obvious what yak shaving should be abandoned altogether, and what yak shaving should be scaled up or outsourced.

Classic change management procedures are a perfect example. They don’t scale, but correctly implemented, they do actually perform the very important task of protecting an operational production environment from human error. Here is a good litmus test to detect yak shaving that should be ported rather than abandoned: If it saves us from larger yak-shaving expeditions in the future, it should be ported. Thus, change management got ported by scale into test-driven development, and is now built in to most deployment pipelines (either run in-house with tools like J enkins, or outsourced to tools like T ravis) where it continues to protect us from human error, but in a more scalable way.

Monitoring is another thing that should be ported, though as a software engineer, you would be forgiven for wanting to bear witness to its demise. Monitoring has always been more or less a hardware-focused activity. It lurked in the depths of the Ops silo, and typically it required a great deal of static configuration to define things like individual servers, which are often short-lived and ephemeral at scale. At first glance it appears that scaling monitoring costs a monumental amount of time spent automating things like service discovery, in exchange for being kept apprised of CPU loads, and other metrics of questionable value.

I could wax prolific at this point about what monitoring at scale looks like; the invaluable insight that a correctly implemented system provides, and the priceless hours it saves you from yak shaving later on, but let’s go into a specific example:

A few engineers are troubleshooting a latency issue in an API this morning. The monitoring system alerts the engineers when a latency metric in the API trips a threshold and, after clarifying the route the data in question took, the engineers notice an overabundance of connections to a single memcache node.

Ben: Hey Jared, unicorns queued is not related to JD, correct?

Jared: Right. The metrics -> jackdaw route goes only through NGINX. Seems like a large memcache blip.

Distributed applications are balanced equations. Problems often manifest themselves as an imbalance among our services (in this case, an influx of requests). Troubleshooting in our world often consists of detecting imbalances like this one and tracking them back to their source before they have a chance to imbalance other services, and do real damage.

As you can see, the telemetry data is invaluable in this respect, and engineers rely on it heavily. In fact, whenever they talk to each other about problems, or about behavior that puzzles us, more often than not data is part of that conversation:

Collin: Jared, did you twiddle something with the jackdaw elb timeout settings about a day ago? I seem to remember hearing something about that and was wondering if it could explain that graph.

Jared: We did increase JD capacity yesterday from 6 -> 9

In these scenarios, the telemetry data is heavily skewed toward metrics like latency, request cardinality and queue behavior – not the kind of data you typically get from the turnkey monitoring agents of those venerable monoliths fermenting at the bottom of the ops silo. That’s to be expected, because scale wouldn’t allow it. Rather, monitoring capabilities must be built into the application itself using a loose-knit collection of language bindings and aggregation tools for truly effective and efficient visibility. The data is then pushed to a scalable, centralized telemetry service where engineers can visualize and alert on it. You can measure exactly what you need, in the way that makes the most sense, and best of all, the monitoring is just code. Your organization can rely on it, because you know exactly how it works (having implemented it yourself).

We aren’t alone. Large customer-facing distributed applications like Slack that are experiencing exponential growth feel acutely both the need for operational visibility, and the ever-present demands of scale. (See here how Slack uses Librato by SolarWinds Cloud to keep users chatting).

Like everything else, scale brings out the best in monitoring. Teams with the know-how to embrace metrics-driven development and scale their monitoring into their codebase, will spend less time mired in yak fur, and more time building and running world-class systems that scale.

Dave Josephsen is the developer evangelist for SolarWinds, where he focuses on real-time cloud monitoring solution Librato, and hacks on tools and documentation, writes about statistics, systems monitoring, alerting, metrics collection and visualization, and generally does anything he can to help engineers and developers close the feedback loop in their systems. He’s written books for Prentice Hall and O’Reilly, speaks shell, Go, C, Python, Perl and a little bit of Spanish, and has never lost a game of Calvinball.