Network Automation Architecture – Part 04

Blog Detail

Over the last two years, in our Telemetry blog posts series we discussed many telemetry and observability concepts, showing characteristics of modern network telemetry. The telemetry stack and its architectural components: collectordatabase, visualization, and alerting make the network telemetry and observability the real evolution of the network monitoring. You probably also already heard from us about Telegraf, Prometheus, Data Enrichment, and Data Normalization. Each of these functions has been already introduced in our blog series.

Introduction to Architecture of the Network Telemetry and Observability

In this blog post, we will focus on the Architecture of Telemetry and Observability. Over the last years at Network to Code we developed the Network Automation Framework, which is also composed of the Telemetry & Observability element. The Network Telemetry and Observability stack is a critical piece of any network automation strategy and is a prerequisite to building advanced workflows and enabling event-based network automation. While I mentioned a few of the tools above, it is important to note that not every telemetry stack is the same, the elements are composable. Due to rapid development growth, many interesting and valuable tools have been made available in the last years.

While we introduced the architecture elements: collector, database, visualization, please refer for the details in Nikos’ blog post. In this particular blog, let’s discuss what we take into consideration while architecting a telemetry and observability solution.

The process of architecting a telemetry system starts with the analysis of requirements. Most common challenges with respect to the telemetry systems are as follows:

  • Heterogeneous data – data coming from different sources, in different formats (CLI, SNMP, gNMI, other)
  • Quality of the data within telemetry system (e.g., decommissioned devices, lack of normalization and enrichment)
  • Quality of the exposed data (i.e., lack of meaningful dashboards)
  • Lack of correlation between events
  • Number of tools involved (including legacy, not deprecated)
  • System configuration overhead (i.e., missing devices)

As you might notice, most of our challenges are due to data quality or complexity, not necessarily due to tools or software used. Those challenges are often the triggers for the telemetry system overhaul or even a complete replacement.

Architecting the Telemetry System

Telemetry Stack Components

During the architecture process, we follow the stack architecture presented below. We consider the stack as composed of the following elements: collector, database, visualization, and alerting. For the detailed information about each of those, please refer to our previous blog posts.

Understanding Requirements

To start the architecture process, we have to define and understand constraints, dependencies, and requirements. Not every system is the same, each one has unique needs and serves a unique purpose.

Dividing requirements with regard to the specific components allows viewing the system as a set of functions, each serving a different purpose. Below, I present just a set of example requirements; while the list is not full, it might give you ideas about how many architectures we could design with different components fitting the use cases. Telemetry stacks are customizable, each of the functions can be implemented in a number of ways, including the integrations between components.

General Requirements – Examples

  • What is the data to be collected? (Logs? Flows? Metrics?)
  • What is the extensibility of the designed system?
  • What is the scalability of the designed system? Is horizontal scalability needed?
  • What is the expected access? (API? UI? CLI?)
  • Who will use the system, and how will they use it? (Capacity Planning Team? NOC? Ad hoc users?)
  • How will the system’s configuration be generated? (Collectors?)
  • How will the system’s load be distributed? (Regional pods?)
  • How does the organization deploy new applications?
  • How are users trained to use new applications?

Collector

  • What is the expected data resolution?
  • What is the expected data collection method? (gNMI? SNMP?)
  • What is the expected data? (BGP? System metrics?)
  • What is the deployment model? (Container on the network device? Stand-alone?)
  • Are the synthetic metrics needed?

Data Distribution and Processing

  • Which data will be enriched and normalized?
  • What are the needed methods to perform data manipulations? (Regex? Enum?)
  • How will the data flow between systems? (Kafka?)
  • How will the data be validated?

Database

  • What is the preferred query language? (Influx? PromQL?)
  • What are the backfilling requirements?
  • What are the storage requirements? (Retention period?)
  • What is the preferred database type? (Relational? TSDB?)

Visualization

  • Can we correlate events displayed?
  • Can we create meaningful, role-based, useful dashboards?
  • Can we automatically generate dashboards? (IaaC?)
  • Can we use source-of-truth data (e.g., site names) in the dashboards?

Alerting

  • What are the available integrations? (Automation Orchestrator? Email? Slack?)
  • How will the alerts be managed?
  • Can we use source-of-truth data (e.g., interface descriptions, SLAs) with the alerts?

Designing the System

The process of designing a telemetry system is preceded by understanding and collecting specific requirements, preparing the proof of concept (“PoC”) plan, and delivering the PoC itself. The PoC phase allows for verifying the requirements, testing the integrations, and visually presenting the planned solution. PoC is aligned with the design documentation, where we document all the necessary details of the architected telemetry and observability system. We find answers for and justify all the requirements: constraints, needs, and dependencies.

Implementing the System

Implementing a telemetry system requires us to collaborate with various teams. As we introduce the new application, imagine we have to communicate with:

  • Network Engineering (system users)
  • Security (access requirements)
  • Platform (system deployment and operations)
  • Monitoring (system users)

Telemetry and observability systems are critical to every company. We must ensure the implemented system meets all the organization’s requirements. Not only do we have to map existing functionalities into the new system (e.g., existing Alerts), we have to ensure all the integrations work as expected.

Telemetry and observability implementation involves the application deployment and configuration management. To achieve the best user experience through an integration, we can leverage the Source of Truth systems while managing the configurations. This means a modern telemetry and observability solution has the Source of Truth at its center. The configuration files are generated in a programmable way, utilizing data fetched from the SoT system to ensure that only information within the scope of the SoT is used to enrich or normalize the telemetry and observability system.

Using the System

While the system is implemented, we work on ensuring the system is being used properly. There are several use cases for the telemetry and observability, thus some of the usage examples involve:

  • Collecting from a new data source or new data (metric)
  • Scaling the collector system for a new planned capacity
  • Presenting new data on a dashboard or building a new dashboard
  • Adding a new alert or modifying an existing one
  • Receiving and handling (silencing, aggregating) an alert

Conclusion

As we recognize the potential challenges of any new system being introduced, we ensure the system’s functions are well known for system users. This is critical for telemetry and observability systems, as those typically introduce a set of protocols, standards, and solutions that might be new in a certain environment.

-Marek



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Network Programmability & Automation, 2nd Edition, Is Out There!

Blog Detail

The second edition of Network Programmability and Automation is already out there!

As Jason and I announced more than one year ago in this blog, I had the honor to join the original authors (Scott Lowe and Matt Oswalt) to work on this new edition.

The goal of the book remains the same—to help network engineers who want to explore network automation and transform themselves with the skills that modern network engineering demands. Because of the broad concepts and technologies involved, this is not a simple goal. We did our best revising the first edition of the book by extending existing topics (for instance, covering classes, exceptions, and multi-threading in the Python chapter), and adding new ones, such as:

  • Cloud: Cloud Networking, Containers, Kubernetes
  • Network Development Environments: Text editors, development tools, and emulation tools (e.g., VirtualBox, Vagrant, Containerlab)
  • Go programming language
  • RESTCONF and gRPC/gNMI: new API interfaces with examples in Python and Go
  • Nornir: a Python framework to orchestrate network operations, with examples with Napalm plugin
  • Terraform: provisioning cloud networking resources as code
  • Network Automation Architecture: a structured approach to building network automation solutions integrating complementary solutions

We also wanted to facilitate the reproducibility of the numerous code examples, so we have published a GitHub repository with the examples referenced in the book. And, due to book length constraints, we also had to relocate some content from the first edition into an extras website.

Personally, it has been an amazing opportunity to improve how to communicate technical concepts, and also being able to help all the network engineers like me who are looking forward to learning and getting better. We hope this book helps you get started on your network automation journey. Enjoy it!

-Christian

PS: You can find it at Amazon.com in paperback and Kindle format.



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Network Automation Architecture – Automation Engine

Blog Detail

In the previous blogs of this series about Network Automation Architecture, 12, and 3, we have presented the key architectural components and their respective functions. This blog will be expanding on the role of the Automation Engine. The automation engine is the component that contains all the tasks that interact with the network, to change the state of the network via configuration management processes.

Introduction

The automation engine is flexible and can be built on top of many different languages and frameworks. A few examples are: Python, Golang, Rust; they can be further broken down into specific frameworks such as: Ansible, Salt, Nornir, NAPALM, Netmiko which is the scope of Python, Terraform for cloud, Scrapligo in the cases of Golang, or even validation focused tools like Batfish. The automation engine attempts to achieve tasks such as: configuration backups/rendering/compliance/provisioning, Zero Touch Provisioning (ZTP), as well as NetDevOps principles such as Continuous Integration/Continuous Delivery (CI/CD).

The automation engine is the component that manages the network state and performs network tasks. This component will actively make changes to the network state and the connectivity between the automation engine and in-scope devices should be permitted by security policy.

Automation Engine

There are some challenges that the automation engine has to solve; interacting with network devices is complicated. Command line interfaces (CLIs) have been the main interface for network engineers to modify and manage network equipment. More recent trends involve an API to interact with specific devices, e.g. Arista eAPI. Alternatively, some vendors are moving toward element managers that offer APIs and handle device connections within their own frameworks. Finally, YANG via NETCONF/RESTCONF/gNMI was developed to attempt to solve vendor independent automation, but is still working towards gaining mass adoption.

CLIs were not built for automation; but over the last many years there have been many projects that have been built and open sourced to help solve these problems. Some of these were mentioned in the introduction; however for the sake of clarity, Nornir, Scrapli(go), NAPALM, Netmiko are all examples that provide frameworks to interact with CLIs and automate these tasks.

These projects generally require a few pieces of metadata:

  • Device platform – which is used to map the platform (or OS) to the network driver for the given framework in use.
  • Device credentials – how the automation engine authenticates to the network device.
  • Management IP address – IP address/FQDN that the automation engine can use to reach a network device.

Note: These are the bare minimum attributes, and they should be stored within the Source of Truth (SoT) component. The automation engine should have a method to query for the information.

While APIs have helped aid the adoption of automation and made the interaction with these devices simpler, each vendors API is implemented differently. The automation engine must provide a flexible interface that is capable of manipulating parameters and reading multiple returned data formats E.g (XML, JSON).

Main Challenges

While considering configuration management and the automation engine in general some of the key challenges are listed below. This is not an exhaustive list.

  • Configuration Management:
    • Configuration Rendering: A few topics to consider; full configuration rendering, partial configuration rendering, secrets interpolation.
      • Secrets Management: How to pull secrets from an external secrets management system, Ansible Vault, other?
    • Configuration Remediation: It’s one thing to do a diff and understand what is extra and what is missing. (As an example, this is solved in Nautobot Golden Config App.) It’s a completely different challenge to remediate those configurations.
    • Configuration Deployment: The process of deploying a rendered configuration onto an element.
  • Configuration Provisioning: Creating objects, such as creating an EC2 instance, Network Functions Virtualization (NFV) Appliance, or network service (such as an AWS IGW).
  • System Load Distribution:
    • What security posture do we need to adhere to?
    • Only certain subnets can speak to management networks?
    • Only certain communication protocols are allowed?
  • Operational Actions:
    • Rebooting a device.
    • Reset IPSEC tunnel.
    • Bounce a interface.
    • Bounce a BGP neighbor.
  • Operational Compliance and Checks: What operational data should be collected, how should the data be transformed?

For some of the more advanced topics mentioned above the next section serves to provide addition details and considerations.

Challenges Clarified

Let’s deep dive into several of nuances of some of these topics.

  • Full vs Partial configuration deployments: This challenge may seem simple but it’s actually quite complex. Before you can push a configuration you must be able to render the configuration; before you can render it you must have the source of truth data. This is truly a crawl, walk, run situation. What are some things you need to consider?
    • Merge vs. Replace
      • Replace at what level? Full config is generally easier that partial configuration merge, Junos allows stanza level replacements, but most OS’s do not.
    • How to push a subset of the configuration. Identify configuration snippets that are least impactful, but provide a great Return on Investment (ROI).
    • How to validate a configuration deployment via a CI/CD pipeline (Fail Fast).
      • This is also an iterative approach. Start simple and grow the complexity.
      • Check out Batfish.
  • Secrets interpolation: There are configuration lines in most vendors that require credential/secret values to be populated. The rendering of configuration by the automation engine must be flexible and secure enough to do this without exposing the secrets to unintended audiences.
  • Remediating a configuration: Remediation of a configuration based on a diff of actual and intended state comes with some business requirements around what the business’s confidence level is, e.g., remediating the configuration completely (including removing “extras”) vs just adding the “missing” configuration elements.
    • An engine like hier_config can provide a remediation plan.

As you can see from the challenges above, there are many questions you must answer. Once these questions are answered, it becomes much easier to try to choose an automation engine that fits your organization’s goals.

Choosing an Automation Engine

One of the biggest challenges with the Automation Engine component of this architecture is picking the right tool(s) for the job. There is no shortage of open source tools that fit this component of the architecture; furthermore, there is an ever-expanding catalog of closed source / vendor specific tools that aim to accomplish the tasks.

This is an interesting topic. Throughout the years, NTC has engaged with many customers. Even customers at the most basic entry into their network automation journey are already using the automation engine element. A simple one-off script that goes and collects data off of a device fits this component. Since this component in most cases is one of the first to be selected, it’s not always easy convincing a client that other options exist.

For these and many other reasons, we’ve found that most of the automation engine options available can achieve great results if you have the rest of the automation architecture in place. Selecting the right engine for your business comes down to skill set, previous adoption, willingness to learn, and in some cases having product support, which many large enterprises rely on today.

Regardless of the application/framework in use, the automation engine communicates with network devices. And as mentioned in Network Automation Architecture – The Components, it’s important to understand the automation engine not as an isolated component, but as the final executor of the outcome of the other components.

Furthermore; there will be situations where a single automation engine does not meet the business requirements, in these circumstances multiple automation engines can be used, but a level of effort should be exhausted to keep the number of different automation engines to a minium; otherwise, the learning curve and skill set to operate/maintain this component gets too complex and leads to slowed adoption.

Some of the characteristics to consider are mentioned below:

  • Does the tool have an API?
    • Most Automation Engines have an API, but is it robust? Is it RESTful and support all the CRUD operations? Is there other types of APIs like GraphQL?
  • Does the tool integrate with the SoT?
  • Does the tool have a User Interface (UI)?
  • Is the tool flexible enough to accomplish RBAC requirements for the customer.
  • Credential Management
  • The ability to create rich and complex Forms
  • Job Isolation
  • Network Device Support
  • Secrets Integration
  • Scheduler
  • Traceability / Logging

Advanced Concepts

One of the biggest challenges related to the automation engine is the connectivity conundrum that exists in enterprises. The security of networks continues to grow in complexity; the management control plane of network devices is no different. In many cases centralized applications aren’t allowed to connect to network devices. Whether that is due to DMZ design, Geo location issues, or mergers and acquisitions, the automation engine must be flexible enough to run inside those pods.

Here are some of the existing solutions to this problem.

Automation EngineSolution
AnsibleExecution Environments
PythonCelery, Redis (RQ), Taskmaster
SaltstackMaster/Minion

Closing

To close out this blog, I want to show what a release process with validation steps might look like in a high-level diagram, this diagram came directly from one of the Webinars Ken Celenza and I did Community Webinar: Using Batfish for Network & Routing Verification.


Conclusion

Keep an eye out for the remaining parts of this series!

Cheers, -Jeff



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!