The Future of IT Troubleshooting

March 21, 2017 No Comments

Featured article by Kong Yang, Head Geek^™, SolarWinds

A common approach to addressing any number of performance issues in IT environments is to throw a bunch of potential solutions at the wall and see what sticks. However, IT environments are becoming increasingly complex and varied, making the dart-on-the-wall strategy even less effective than it already was. As such, here I will cover what IT troubleshooting should look like in the hybrid IT world we now find ourselves in, and will continue to be encompassed by into the future.

What is IT Troubleshooting?

Troubleshooting is a core IT skill. It is also a key element of what we at SolarWinds term monitoring as a discipline. The goal of troubleshooting is to drill down to the primary issue affecting the performance, delivery, and consumption of an application or service. Without a firm grasp of this skill, IT professionals are unable to gain a deep understanding of the underlying cause and effect of any incident. However, troubleshooting multi-layer IT issues often transcends functional silos within the larger organization and technologies like cloud, hybrid IT, virtualization, and hyperconvergence have fundamentally transformed IT, rendering troubleshooting more critical, yet at the same time, more complex than ever.

IT Troubleshooting Basics

To understand what the future of troubleshooting looks like, and why it’s more important than ever, we should first cover the basic steps of troubleshooting. These eight fundamental steps are applicable to any IT professional, any organization, and any IT environment:

1. Define the problem
2. Gather and analyze relevant information
3. Construct a hypothesis or probable cause
4. Devise a plan to remediate
5. Implement the plan
6. Observe the results and recreate the plan to reproduce or reverse-engineer the results
5. Repeat steps 2-6 as necessary
6. Determine the root cause and document it

Although simple, these steps remain consistent regardless of whether you’re dealing with traditional on-premises infrastructure, hybrid IT, or even a DevOps-centric scenario.

What has changed, however, is the volume and velocity in technology and services, which affect the rules of engagement for IT professionals. We are consistently short on time—there are never enough hours in the day. So, with the speed and amount of change in the technology we’re managing, monitoring, and remediating in what are often siloed roles, it’s important to take a fresh look at the tools of the trade when it comes to troubleshooting.

Troubleshooting in a Hybrid IT Environment

As discussed, troubleshooting involves roughly the same sequence of steps across almost any environment. However, because hybrid IT has become the norm, I’ll focus the remainder of my commentary on troubleshooting across hybrid IT environments. In fact, according to the SolarWinds IT Trends Report 2016, a mere 9 percent of organizations have not migrated any infrastructure or applications to the cloud, yet 60 percent stated they will likely never transition all services offsite.

Consider the following example:

Envision a tiered application with some compute and memory resources in the form of virtual machines on-premises, and some of those as web and application layers in the cloud, hosted by a provider such as Amazon^® Web Services (AWS^®).

When a ticket comes in saying the application is slow, the initial administrator who sees the ticket is likely only responsible for just one portion. So the ticket might be first serviced by the application team because it’s associated with their application. However, once the application administrator begins troubleshooting, they may realize it’s not an application issue based on the performance logs, response times, lack of event anomalies, etc., so the ticket gets passed to the network team.

The network team hopefully has the necessary tools to see network performance across all providers, from the internal data center to the cloud service provider—in this case, AWS. With this visibility, they can examine hops and may determine that there is some latency, but it might not be the cause of the original application degradation described in the ticket.

The ticket then gets passed to the infrastructure team, which doesn’t have time to troubleshoot the root cause, but can see and isolate the current symptoms, providing a temporary fix. Unfortunately, the root cause problem is actually neither fully identified nor resolved.

A key problem in this example, and one that applies to many of today’s IT departments, is that the collective IT organization doesn’t have the capability to cut through the layers of the application stack and quickly surface a single point of truth for applications.

Thus, as technology constructs become increasingly distributed, complex, and even unintendedly siloed, we face the challenge of ensuring application or service performance regardless of its architecture and delivery. Troubleshooting abilities must evolve to allow us to deliver a positive experience for end-users by quickly getting to and understanding the root cause of issues better.

The future of IT troubleshooting is one in which monitoring with discipline and troubleshooting tools encourage cross-team collaboration.

The future of IT troubleshooting requires an entirely new way of visualizing and correlating IT monitoring data to improve troubleshooting of performance issues across the IT environment, from infrastructure to networking to applications, and from on-premises to cloud service providers.

The future of IT troubleshooting is the ability to simply combine and correlate time-series metrics as well as historical performance metrics from multiple hybrid IT data sources, including applications, compute, network, storage, virtualization, web, and the cloud, into a single dashboard to visualize the relationship in ways never before possible. For example, charting network latency and bandwidth data inside and outside an IT organization’s firewall together with compute metrics from cloud virtual machines, like Amazon EC2^® instances, to troubleshoot application performance issues.

This is the future of IT troubleshooting.

Conclusion

The troubleshooting process can be more convoluted than ever before, often requiring collaboration among many different functional silos within IT organizations and beyond, such as cloud service providers. While the basic principles of troubleshooting remain applicable in our brave new world, the tools we use to aid us in those steps must evolve to address the challenge.