When It Comes to Automation, It’s (Still) About the Culture

Blog Detail

prior NTC blog post by Tim Fiola laid out the case that it is actually a company’s culture, not technical prowess, that enables automation to take hold.

This blog will dig a bit deeper into that cultural component and discuss how the culture around automation helps a company derive real value. We will discuss two primary aspects:

  • How the Network Engineer’s role changes when adopting an automation first approach
  • How an organization could be structured to derive the most value from automation

The Network Engineer’s Role Will Change, Not Disappear

One common misconception that network services organizations share is that Network Engineers are now obsolete and that automation will eliminate their jobs or require them to become developers. This is patently false.

There will always be a need for people who understand the IT infrastructure and network at a technical level.

The truth is that while the Network Engineer’s role will need to change to adapt to an automated environment, the Network Engineer’s knowledge and skills are still very much needed. This blog describes what that means in practice.

Network Engineers Do Not Need to Become Developers

Oftentimes network engineering leadership and Network Engineers themselves assume that Network Engineers need to be completely re-skilled. Repeat with us…. Network Engineers do not NEED to become software developers. In reality, a firm’s automation evolution does not necessarily mean that the Network Engineers will have to become Developers, Network Reliability Engineers, or Site Reliability Engineers. Network and software engineering are their own skill sets and need to be respected as such.

While it is true that there are additional skill sets required to run an automated IT infrastructure, the nuance here comes from the company’s size, capabilities, and ultimate goals. For example, can the company create and nurture the required skill sets in-house, or will it need to take on partners to fulfill some roles required to run an automated infrastructure?

A Network Engineer’s responsibilities in an automated environment don’t require that they become a full-time developer. At a minimum, however, those responsibilities will likely require that the Network/IT Engineer learn design-oriented thinking. Design-oriented thinking means consideration of how to break larger workflows into smaller, more manageable, and repeatable parts. Oftentimes Network Engineers develop this type of thinking when they learn basic Python, some other coding language, or programmatic tool.

Growing your skill set incrementally is an important part of your career path: it is an opportunity to maintain your relevance.

It is important here to draw an explicit distinction between learning a coding language, such as Python, versus becoming a full-time Developer: learning new incremental skills is important in any existing career path. There is no hard requirement to become a Developer, which is an entirely different career path.

Traditional Network Engineering Disciplines WILL Need to Adapt to an Automated Environment

The traditional Network Engineer who enjoys CLI or copy/paste of commands for config changes is going to be disappointed. Those tasks will not be required on any type of scale in an automated environment.

Here are some reasonable day-to-day changes in duties that a Network or IT Engineer can expect on the firm’s automation journey:

  • Examine how to automate existing workflows with design-oriented thinking
  • Employ design-oriented thinking when designing new workflows
  • Become product owners or subject-matter experts (SMEs) for network products/services
  • Write prototype scripts and then work with a developer to
    • Harden them for production
    • Make them scalable
    • Design appropriate automated testing
  • Work with developers to translate configuration and functional device requirements into configuration templates
  • Work with the broader organization to integrate Network/IT Engineering’s automation into a broader automation infrastructure
  • Understand how to improve network and IT infrastructure health with automated solutions
  • Learn enough Python/Ansible/Git/etc. to understand and trust an automated infrastructure

There will always be a need for the Network Engineer’s knowledge; it’s just that the knowledge will be expressed differently.

Notice that each of the tasks above requires an understanding of how a network operates and the mechanics that take place within a network. This brings us to a very important point: there will always be a need for the Network Engineer’s knowledge; it’s just that the knowledge will be expressed differently.

Cultural changes often meet resistance.

Take a look again at the above statement: “the knowledge will be expressed differently.” When you tell someone that the way they go about their job, which is an aspect of their way of life, is going to change, that is a cultural change. Cultural changes can be difficult and oftentimes meet resistance. This is why it’s important for a firm’s leadership to be aware of, foster, and encourage the changes.

How Should the Organization Be Structured and Managed?

The structure of an organization is perhaps the most important factor that will determine how successful an automation transformation will be.

There are several very important, interrelated factors that fold into this structure.

Trust . . .

. . . Across the Firm

Trust across the organization’s different components will ultimately dictate the scope of its automation.

Imagine a Network Engineer writing a Python script or Ansible playbook to add a VLAN to a network device and interface to make their life easier. Adding a VLAN is a task that the Engineer may do multiple times a day, and the script will allow the Engineer to spend perhaps one minute on that task that may have taken seven minutes prior.

Without trust, automation provides localized value in the silos, but it will fail to deliver strategic value back to the company.

The larger picture here deals with a workflow. It is important to focus on workflows because a workflow is how a firm transforms their resources into value. Workflows typically contain many tasks and localized bottlenecks along the way. It is the sum of these tasks and bottlenecks that determine the overall throughput that the workflow can handle.

In order to realize value from automating a given workflow, the company needs to reduce the time for each task and mitigate the bottlenecks.

In this example, the Engineer is adding a VLAN to a device and interface as part of an end-to-end service activation workflow that includes the internal end users, Network Engineering, Capacity Planning, and Procurement departments.

However, the larger company will not benefit from this localized/siloed automation because using a script to quickly add the VLAN to the device does not have a dramatic impact on the end-to-end workflow. Using the script makes the Engineer’s life easier, but doing so has a minimal impact on increasing the throughput of the end-to-end workflow.

The company at large won’t benefit from siloed automation.

The firm will not benefit unless there is trust and cooperation between groups to automate each task in the complete workflow so that the time to execute the workflow drops and the amount of times the workflow can be executed in a given period increases.

There must be trust and cooperation between groups.

. . . And in the Technology

Another aspect of trust is trust in the technology itself. If the people involved, including the Network/IT Engineers, do not trust the technology, they will not accept it. Understanding not just what the technology does, but how it operates under the hood goes a long way toward building trust in the technology. This article earlier stated that it is likely that a Network Engineer, for example, would need to learn basic Python: building trust in the technology is part of the why.

People won’t trust what they don’t understand.

Process

One of the benefits of starting the automation journey is that it requires in-depth formal process discovery.

Here is a very common example: someone at the company is tasked with automating a given workflow, and so starts to question all the groups involved. Along the way, the person discovers that in step 5 of the workflow, there is a need for a specific IP address to be assigned to a given interface and SOMEHOW the Engineer gets that IP address.

Shadow workflows often pop up when formal workflows are not well-defined; in many instances like this, many people will just take extra steps to get the work done. These extra steps are typically never documented, which means the company is blind to the full workflow process.

Where did this specific IP address come from? In our example scenario, the IP address was not an input into the original provisioning order. After some more digging, it turns out that the person tasked with provisioning the interface in step 5 goes to a spreadsheet that is maintained by some person in another department who happens to know what IP addresses belong on a given device. Until this examination of the workflow happened, that spreadsheet and the information in it were never part of the documented workflow: it was a part of a shadow workflow.

Shadow workflows often pop up when formal workflows are not well-defined. When this happens, people responsible for executing the workflow often take the initiative to find the required information themselves, but those extra steps often don’t get documented.

Automation is not strictly about automating: it is first about understanding, and then automating.

Since automating a workflow, by definition, requires understanding all the steps, it forces the company to get a full understanding of their workflows. Automation is not strictly about automating: it is first about understanding, and then automating.

Reuse

An automated organization needs to be set up to share and reuse code across silo boundaries. For example, a script to get a subnet in a workflow for a new router deployment can also be used in a workflow to deploy a new server farm. Coordinating automation that will deliver real value back to the company requires a central group to plan and coordinate workflow steps and the methodology to implement those steps in each workflow.

Change Control

Change control is perhaps one of the biggest considerations when transitioning to an automated infrastructure. The central question around automating network changes is this: does an automated change require the same process as a traditional change?

The short answer to that question is likely No. The longer answer goes back to culture:

  • What does change management really need to assess the risk of an automated change?
  • Does an automated change require the same process as a traditional (manual) change?
    • Can the company make a new process for automated changes?
  • Do the parties trust the technology and automated processes?
    • How can they build that trust? (testing, transparency, training, etc.)

Should change management look the same for both manual and automated changes?

All operations resulting in a change should be part of a larger change management strategy that considers:

  • The risk associated with a change
  • The expected impact of a successful or unsuccessful change
  • The tracking and/or auditing of changes
  • The communication of change
  • A rollout and rollback plan
  • The approval workflow and sign-off process
  • Scheduling of changes

The prior NTC blog post on culture also covers some other considerations for Human Resources and management.

Wrapping Up

Specific technology references have been conspicuously absent from this blog post. This is because automation is ultimately sustained via a cultural transformation; specific technologies are a secondary consideration. In some cases, it’s about changing what a firm does; in other cases, it’s about changing how the firm goes about doing existing things, like executing workflows. These are cultural changes that need to be fostered and encouraged across the firm, because the changes need to take place across many organizations and levels in the firm.

It is ultimately the cultural realignment within the firm, including the points discussed above, that will determine whether the automation transformation will be successful in the long run.

Arguably, we can consider a large portion of the technical part of the automation journey solved: today, firms can select the right technical components from a wide variety of options to fit into their automation architecture and needs, with more options and improvements being added as time goes on. The technology is sound. It is ultimately the cultural realignment within the firm, including the points discussed above, that will determine whether the automation transformation will be successful in the long run.

Thank you, and have a great day!

-Tim Fiola



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Introducing Nautobot v2

Blog Detail

Nautobot v2.0 was recently released, and we’re excited to share the new features and important changes it brings to the network automation community! We’re currently hard at work on the next release (v2.1), and by the end of 2023 we will provide some insight into what it and the rest of the v2 release train will bring.

IPAM Enhancements

  • Namespaces: Namespaces have been introduced to provide unambiguous uniqueness boundaries to the IPAM data model. Prefixes and VRFs are now assigned to Namespaces, which allows for a variety of data tracking use cases, but primarily targets overlapping or duplicate address space needs.
  • Prefix and IP Address Relationships: In Nautobot v2, the Prefix and IP Address hierarchy now relies on concrete parent/child relationships. In Nautobot v1, these relationships were calculated dynamically and often led to inconsistent or confusing hierarchies, especially with overlapping address space. This change ensures an always consistent data set and offers several performance improvements in the UI and API.
  • IP Address Interface Assignments: Stemming from the other data model changes, IP Addresses can now be assigned to multiple interfaces to more easily track topologies where this is required. In the past, there were special concessions for Anycast needs; but in v2, you can now intuitively deal with other situations like duplicate loopback addresses or cookie-cutter network segment deployments.

Unified Role Model

The new release consolidates existing role models into a singular, streamlined approach, akin to the handling of Statuses. This change simplifies the management of user-defined roles across DCIM, IPAM, and other areas of the network data model. Like Statuses, users now define the roles they want and which models those roles apply to, in one central location.

Location Model Consolidation

Nautobot v2 phases out the Site and Region models, integrating their functionalities into the Location model. This consolidation streamlines data management and reduces complexity. The Location model allows users to define a hierarchy of Location Types that is specific to their organization. Location Types also define what types of objects can be assigned to those parts of the hierarchy, such as Devices or Racks. The consolidated Location model allows for modeling physical, logical, or a mix of both types of entities. Examples might be tracking countries with assets, or defining logical areas in a data center DMZ.

CSV Import/Export

Updates to CSV functionality include consistent headers across different modules and more relevant data for managing relationships, making data import/export tasks more intuitive and efficient. Nautobot v2.1 will move export operations (CSV and Export Templates) to a system-provided background Job, which will mean users can export large data sets without worry that the operation might timeout.

REST API Improvements

  • Depth Control: This provides enhanced control over query depth in the REST API, which allows API consumers to specify the amount of data and context they need in a given request. This replaces the ?brief query parameter in the API.
  • Version Defaults: New Nautobot v2 installs will now default to the latest version of the REST API, which means consumers can always take advantage of new features by default. Administrators retain the ability to specify a specific version, where required.

Application Development APIs

  • Code Namespace Consolidation: The apps code namespace has been reorganized for better clarity and manageability. Most app development dependencies can now be imported from the nautobot.apps module.
  • Code Reorganization: As part of cleaning up the apps namespace, many items related to apps and in the core project have been relocated, but most things app developers need can be found in the nautobot.apps module.
  • Developer Documentation: We have made several improvements to the overall structure of the developer documentation and will continue to put significant effort into this area throughout the v2 release train and beyond.

Jobs Updates

  • Logging Control: Logging statements in Jobs have changed to offer authors better flexibility and control. Most notably, logging is achieved with the Python standard logging facilities, with special arguments to specify whether the log message should be saved to the JobResult (displayed in the UI) or simply logged to the console.
  • Atomic Transaction Changes: In Nautobot v2, Jobs are no longer run inside an atomic transaction context manager. This means authors now have the choice to make their Job atomic or not, by implementing the context manager themselves. A common dry-run interface is provided, but it is up to the author to implement support, much like Ansible modules.
  • State Management: Similar to the atomic transaction changes, Job authors now have full control over the state of job executions. This means authors are now responsible for explicitly failing a Job, based on their desired logic.
  • File Output and Downloads: Nautobot v2.1 will introduce the capability to generate and output files from Jobs and allow users to download those files in the JobResult’s UI. This capability, built to support the export functionality explained earlier, will be offered to Job authors as an official API.

Revamped User Interface

Nautobot v2.1 will see a facelift of the Nautobot web UI to align with a more modern look and feel. We also hope you will enjoy the navigation moving to a sidebar.

Looking Beyond 2.0

While we have touched on a few important features in the upcoming v2.1 release, the entire v2 release train will remain focused on several network data model enhancements and exciting new automation features. Some of the things we have planned include:

  • More device metadata like software and hardware family
  • Cloud Networking models
  • Device modules
  • Breakout cables
  • External Integrations configuration management
  • Jobs workflows

Conclusion

We hope you are as excited as we are about the future of Nautobot and invite you to try it out in our demo environments. demo.nautobot.com is the current stable release (v2.0, as of this publication) and next.demo.nautobot.com is the next release we are working on (v2.1, as of this publication).

-John Anderson (@lampwins)



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Network Automation Architecture – Part 04

Blog Detail

Over the last two years, in our Telemetry blog posts series we discussed many telemetry and observability concepts, showing characteristics of modern network telemetry. The telemetry stack and its architectural components: collectordatabase, visualization, and alerting make the network telemetry and observability the real evolution of the network monitoring. You probably also already heard from us about Telegraf, Prometheus, Data Enrichment, and Data Normalization. Each of these functions has been already introduced in our blog series.

Introduction to Architecture of the Network Telemetry and Observability

In this blog post, we will focus on the Architecture of Telemetry and Observability. Over the last years at Network to Code we developed the Network Automation Framework, which is also composed of the Telemetry & Observability element. The Network Telemetry and Observability stack is a critical piece of any network automation strategy and is a prerequisite to building advanced workflows and enabling event-based network automation. While I mentioned a few of the tools above, it is important to note that not every telemetry stack is the same, the elements are composable. Due to rapid development growth, many interesting and valuable tools have been made available in the last years.

While we introduced the architecture elements: collector, database, visualization, please refer for the details in Nikos’ blog post. In this particular blog, let’s discuss what we take into consideration while architecting a telemetry and observability solution.

The process of architecting a telemetry system starts with the analysis of requirements. Most common challenges with respect to the telemetry systems are as follows:

  • Heterogeneous data – data coming from different sources, in different formats (CLI, SNMP, gNMI, other)
  • Quality of the data within telemetry system (e.g., decommissioned devices, lack of normalization and enrichment)
  • Quality of the exposed data (i.e., lack of meaningful dashboards)
  • Lack of correlation between events
  • Number of tools involved (including legacy, not deprecated)
  • System configuration overhead (i.e., missing devices)

As you might notice, most of our challenges are due to data quality or complexity, not necessarily due to tools or software used. Those challenges are often the triggers for the telemetry system overhaul or even a complete replacement.

Architecting the Telemetry System

Telemetry Stack Components

During the architecture process, we follow the stack architecture presented below. We consider the stack as composed of the following elements: collector, database, visualization, and alerting. For the detailed information about each of those, please refer to our previous blog posts.

Understanding Requirements

To start the architecture process, we have to define and understand constraints, dependencies, and requirements. Not every system is the same, each one has unique needs and serves a unique purpose.

Dividing requirements with regard to the specific components allows viewing the system as a set of functions, each serving a different purpose. Below, I present just a set of example requirements; while the list is not full, it might give you ideas about how many architectures we could design with different components fitting the use cases. Telemetry stacks are customizable, each of the functions can be implemented in a number of ways, including the integrations between components.

General Requirements – Examples

  • What is the data to be collected? (Logs? Flows? Metrics?)
  • What is the extensibility of the designed system?
  • What is the scalability of the designed system? Is horizontal scalability needed?
  • What is the expected access? (API? UI? CLI?)
  • Who will use the system, and how will they use it? (Capacity Planning Team? NOC? Ad hoc users?)
  • How will the system’s configuration be generated? (Collectors?)
  • How will the system’s load be distributed? (Regional pods?)
  • How does the organization deploy new applications?
  • How are users trained to use new applications?

Collector

  • What is the expected data resolution?
  • What is the expected data collection method? (gNMI? SNMP?)
  • What is the expected data? (BGP? System metrics?)
  • What is the deployment model? (Container on the network device? Stand-alone?)
  • Are the synthetic metrics needed?

Data Distribution and Processing

  • Which data will be enriched and normalized?
  • What are the needed methods to perform data manipulations? (Regex? Enum?)
  • How will the data flow between systems? (Kafka?)
  • How will the data be validated?

Database

  • What is the preferred query language? (Influx? PromQL?)
  • What are the backfilling requirements?
  • What are the storage requirements? (Retention period?)
  • What is the preferred database type? (Relational? TSDB?)

Visualization

  • Can we correlate events displayed?
  • Can we create meaningful, role-based, useful dashboards?
  • Can we automatically generate dashboards? (IaaC?)
  • Can we use source-of-truth data (e.g., site names) in the dashboards?

Alerting

  • What are the available integrations? (Automation Orchestrator? Email? Slack?)
  • How will the alerts be managed?
  • Can we use source-of-truth data (e.g., interface descriptions, SLAs) with the alerts?

Designing the System

The process of designing a telemetry system is preceded by understanding and collecting specific requirements, preparing the proof of concept (“PoC”) plan, and delivering the PoC itself. The PoC phase allows for verifying the requirements, testing the integrations, and visually presenting the planned solution. PoC is aligned with the design documentation, where we document all the necessary details of the architected telemetry and observability system. We find answers for and justify all the requirements: constraints, needs, and dependencies.

Implementing the System

Implementing a telemetry system requires us to collaborate with various teams. As we introduce the new application, imagine we have to communicate with:

  • Network Engineering (system users)
  • Security (access requirements)
  • Platform (system deployment and operations)
  • Monitoring (system users)

Telemetry and observability systems are critical to every company. We must ensure the implemented system meets all the organization’s requirements. Not only do we have to map existing functionalities into the new system (e.g., existing Alerts), we have to ensure all the integrations work as expected.

Telemetry and observability implementation involves the application deployment and configuration management. To achieve the best user experience through an integration, we can leverage the Source of Truth systems while managing the configurations. This means a modern telemetry and observability solution has the Source of Truth at its center. The configuration files are generated in a programmable way, utilizing data fetched from the SoT system to ensure that only information within the scope of the SoT is used to enrich or normalize the telemetry and observability system.

Using the System

While the system is implemented, we work on ensuring the system is being used properly. There are several use cases for the telemetry and observability, thus some of the usage examples involve:

  • Collecting from a new data source or new data (metric)
  • Scaling the collector system for a new planned capacity
  • Presenting new data on a dashboard or building a new dashboard
  • Adding a new alert or modifying an existing one
  • Receiving and handling (silencing, aggregating) an alert

Conclusion

As we recognize the potential challenges of any new system being introduced, we ensure the system’s functions are well known for system users. This is critical for telemetry and observability systems, as those typically introduce a set of protocols, standards, and solutions that might be new in a certain environment.

-Marek



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!