Best Practices for Efficiently Preventing and Managing Incidents

May 26, 2016 No Comments

Featured article by Akhil Sahai, Ph.D, VP Product Management, Perspica

There are few things worse for an IT Operations executive than a mission-critical application outage. The operations team must address it immediately, and as the clock ticks, the bottom line shrinks. The pressure is on to reduce the mean time to repair (MTTR), but when the application is up again, it’s time to consider how this fire drill could have been avoided.

It’s not that the team received no alerts. In fact, just the opposite – the alarms came in like a flood. As one alarm went off, so did another, and another and another, until the actual cause of the incident became buried in the sea of red. The snowballing effect moved quickly from application performance degradation to a major outage. By the time the incident was brought under control, the company could have lost on an average as much as $750,000 for a 90-minute outage, according to the Ponemon report, “Cost of Data Center Outages,” in addition to loss of face and damage to brand value.

Post-mortem is difficult at this point. The status quo is for subject matter experts to gather in a war room, going through multiple product consoles and logs to identify the cause of an incident with a lot of finger-pointing and passing the buck that is inherent in the process. This method is putting companies at a severe disadvantage. We will highlight how deep machine learning-based root-cause analytics and predictive analytics technologies are helping organizations dramatically prevent such incidents, reduce mean time to repair and save brand reputation.

Digital Growing Pains

A digital-first mentality is necessary for today’s complex and connected landscape, meaning that organizations must consider the design, resource and deployment of IT accordingly. This requires teams to manage unparalleled amounts of data while predicting and preventing outages, in real time, and maintaining and delivering agile, reliable applications. The problem is that most organizations must tap several different siloed vendor tools to assist in the monitoring, identifying, mitigation and remediation of incidents and hope that they speak to each other, which traditionally hasn’t happened.

The transition to new architectures and from physical to hybrid and multi-cloud environments makes it increasingly impossible for IT administrators to keep up with the multitude of objects, with thousands of metrics generating data in near-real time.

New approaches must be employed to provide intelligence in order to ensure availability, reliability, performance and security of applications in today’s digital, virtualized and hybrid-cloud environments. Automated, self-learning solutions that analyze and provide insight into ever-changing applications and infrastructure topologies are essential in this transformation.

Defining Our Terms

Vendors are careful to add catchphrases like “machine learning” and “big data” into their

marketing materials because organizations understand that these features can convince their customers that they tackle the complex and dynamic needs of application performance. However, what vendors say they have and what they actually mean by those phrases don’t always jive. Let’s define our terms:

– Domain Knowledge: The domain knowledge in TechOps and DevOps helps answer questions like: What just happened? What caused it? How do we remediate it? How do we not let it happen again?

– Big Data Architecture: The ability to handle huge volumes of data, both structured and unstructured, in an automated, highly scalable way using open source technologies.

– Machine Learning: This technology is about much more than just visualizing the data. Vendors are throwing around the term “machine learning,” but it’s often a mischaracterization. Machine learning is self-learning, supervised or unsupervised algorithms that can be based on neural networks, statistics or Digital Signal Processing et al.

Five Features to Prevent and Manage Incidents

No company, no matter how big or small, can afford an application outage, but they need to make sure they look past the slick marketing. Before a company moves forward with a solution, there are a few points they should keep in mind:

– Scalability: The solution needs to scale to handle millions of objects; legacy solutions are not adequate for today’s big data.

– Automation: Admin and IT support are the ones handling the day-to-day tasks. The solution needs to be automated in a way that quickly pinpoints the root cause of the problem and identifies how to fix it, rather than relying on expensive domain experts.

– Clarity: IT workers can become fatigued and apathetic – so when the system is actually disrupted, no one is paying attention. Instead of just alarms, identify solutions that provide answers and help determine exactly what needs immediate attention.

– Remediation: Pooling knowledge of in-house subject matter experts is difficult and time-consuming when incidents do happen. Having access to vendor knowledge bases, discussion forums and the latest state-of-the-art technologies is important. Look for solutions that can curate tribal knowledge for repeatability but also can integrate crowdsourced knowledge into the mix.

– Preventing Outages: The key to preventing outages is to predict issues before they become problems rather than relying on traditional monitoring tools that trigger alerts only after a problem has already occurred. Look for a solution that can alert you to anomalous trends or potentially dangerous issues before they impact your application.

Overcoming Alert Fatigue

It’s absolutely necessary to receive alerts, but the current alert overload that most IT teams are experiencing ends up being counter-productive. Staff develop alert fatigue so that when a true emergency hits, they aren’t paying attention – just like the story of the boy who cried wolf. Fortunately, today’s next-generation solutions provide operations monitoring and analytics that enables real-time answers and recommendations for remediation. Finally, IT teams will be able to hear the real wolf howling amidst all the noise.

About the Author:

Akhil Sahai is an accomplished management and technology leader with 20+ years of experience at large enterprises and at startups. Akhil comes to Perspica from HP Enterprise where as Sr. Director of Product Management, he envisaged, planned and managed the Solutions Program. At Dell, as Director of Products, Akhil led Product Strategy and Management of Dell’s Converged Infrastructure product line. He also led Gale Technologies, as VP of Products to its successful acquisition by Dell. Prior to that, at Cisco he undertook business development for VCE Coalition, and at VMware, he managed global product strategy and management for vCloud Software with focus on applications, and Virtual Appliances product line. He has published 80+ peer-reviewed articles, authored a book, edited another, and chaired multiple International IEEE/IFIP Conferences. He has filed 20 technology patents (with 16 granted). He has a Ph.D. from INRIA France and an MBA from Wharton School.