Using Network Automation to Automate Incident Resolution

November 17, 2022

In Enterprise networks, there are thousands of incident tickets opened each month. These network incidents and associated tickets are often the result of point-in-time thresholds being exceeded or one-time events that aren’t correlated to a chronic problem. Network automation can drastically save time for a NOC that is often inundated with these requests. Gartner’s Chris Saunderson recently noted that even simple automations can deliver substantial rewards. However, choosing the right types of tickets as candidates for automation is essential. In this blog post, we’ll cover how to identify those tickets. Additionally, we’ll provide some process guidelines to help ensure your success.

Starting the Journey

The first step in the process is to identify incidents as potential candidates to automate. We do this to ensure that we’re providing adequate return on investment from the inception of the project. Incident tickets are resolved with a combination of business and troubleshooting processes. We’ll want to avoid complexities on the business process side, like engaging customers via phone or logging into vendor portals. These types of tasks can be quite difficult to code and subject to change. For example, a vendor may redesign their portal, rendering the automation obsolete. Of course, if the vendor has an API for their portal, it could make sense. But the goal is to be smart in the decision-making process early on. Additionally, we’ll want to avoid complexities in the troubleshooting process as well, such as logging into multiple devices or actions that have the potential to adversely impact the network. The best candidates will have a well-defined, straightforward process that is unlikely to change. For example, a one-off interface bandwidth utilization incident that clears when the traffic flow ceases. In this case, an engineer typically logs into the device, ensures that the bandwidth utilization no longer exceeds threshold, monitors the bandwidth utilization for a short time, and then proceeds with ticket closure. Another example may be an incident that was triggered from a device reboot. An engineer will typically log in and just look at the reason for reload and enter that into a ticket.

These are both fairly simple processes. In summary, keep your first initiatives simple. The goal is to provide a quick return on investment to generate stakeholder support and learn new skills for future endeavors that are more complex.

Gathering and Classifying Ticket Data

In order to identify the right candidates, we need to understand the types of tickets our organization resolves frequently. This process starts with gathering several months of ticketing data. Here are a few links to help with that task: ServiceNow List Export, BMC Export. I would recommend exporting 60 to 90 days’ worth of data.

Once exported, we’ll start by classifying tickets along four criteria:

Device Type – What type of device has experienced a failure? Examples include: Router, Switch, Firewall, Wireless LAN Controller, Access Point, SD-WAN Controller, etc.

Failure Type – What type of failure has the device experienced? Examples include: IGP Neighborship Down, Device Unreachable, Interface Bandwidth, WLC AP Failover, etc.

Cause Code – What caused the failure to occur? Examples include: Power Loss, Adjacent Device Reboot, System Crash, Carrier Issue, physical damage, etc.

Resolution Code – What was done to correct the failure? Examples include: Device RMA, Cleared before isolation, No trouble found, Interface bounce, On-site assistance, etc.

Ideally, you should be tracking this information prior to starting this endeavor. If you’re not, I would recommend implementing these classifications as part of your incident management process. Then revisit your incident automation efforts after a few months.

Here are a few examples of individual tickets being classified with the criteria mentioned above:

Example Incident Ticket 1:

Device type: Switch
Failure Type: Interface Down
Cause Code: Neighbor Reboot
Resolution Code: No Trouble Found

Example Incident Ticket 2:

Device Type: WLC
Failure Type: AP Failover
Cause Code: Adjacent Device Reboot
Resolution Code: Interface Bounce

Example Incident Ticket 3:

Device Type: Router
Failure Type: Device offline
Cause Code: Device reboot
Resolution Code: Power Restored

These criteria should be considered the initial bases for classifying tickets before parsing and sorting, but do not exclude from consideration other criteria that may be helpful as well. For example OS version, model, and location.

Ticket Type Groups

With each ticket sufficiently populated with sorting criteria, we can begin the process of identifying the best candidates for automation. This is done by grouping tickets by troubleshooting processes into “Ticket Types”. Sometimes a ticket type will span multiple device types or failure types because the same troubleshooting processes were used to resolve the ticket.

Once our tickets are sufficiently grouped, we’ll get a count of each. I would recommend familiarizing yourself with Excel Pivot Tables, as they greatly ease the difficulty of this task.

Next, you’ll want to sort the ticket types by count. Specifically, focus on the top ten or twenty. For each of those ticket types that you’ve isolated, assign a combined business process and technical complexity score. For now, a binary complexity score of “High” or “Low” is sufficient. Remove the “High” complexity tickets from consideration.

You should now have a handful of low complexity ticket types with a high enough volume to provide sufficient ROI. We’ll refer to these as our “Low Hanging Fruit”.

Reviewing the Troubleshooting Process

Once you’ve identified the first ticket type that you’d like to automate, you’ll want to carefully review internal documentation and a few dozen tickets related to this type of failure. This will help ensure that you have a full understanding of the troubleshooting process. As you’re performing this review, keep the following questions in mind:

Are engineers deviating from the established process in some tickets? If so, why?
What actions are they taking, and what order are they taking them in?
How is the state of the ticket transitioning as it is worked (on hold, in progress, resolved, closed, etc.)?
What types of cause and resolution codes are being used? Why are those codes being chosen?
What information are engineers including in the ticket notes? Are those notes internally or externally facing?
What types of diagnostic information is being added to the ticket?
Are impacting actions being taken in the troubleshooting process? How does the engineer know when it’s appropriate to take such a step? What are the risks of taking these actions via automation? Can those risks be mitigated?

When you have finished this review process, you should start to have a solid understanding of the tasks required to resolve this ticket type. I would advise against performing impactful actions like shutting down neighborships or bouncing interfaces. A seemingly insignificant oversight, amplified by the power of automation, has the potential to cause serious harm. You should do so only if you’re very confident that the action will not cause harm to the network.

Conclusion

In summary, identifying the best candidates for network incident automation will increase your project’s likelihood of success. Ensure that you’re working on fertile ground by performing a few months’ worth of ticket analysis. Then ensure that you’re automating low-complexity processes so as not to get bogged down in development. Additionally, take care not to take too many risks when you start out. Network to Code wishes you the best of luck; we look forward to seeing you develop the automations of tomorrow.

-Chris

Tags :

automation

Does this all sound amazing? Want to know more about how Network to Code can help you do this, reach out to our sales team. If you want to help make this a reality for our clients, check out our careers page.