Introduction to a Telemetry Stack – Part 4

Blog Detail

Welcome back for Part 4 of the Telemetry Stack! series. The action is steadily ramping up and sticking with The Fast and the Furious analogy. We actually have two guest stars (read: services) featured in this blog. However, I don’t want to spoil the surprise, so you’ll just have to read on!

In this post we will focus on advanced alerting techniques, such as the deadman and standard deviation. Then we will see how we can utilize a few Prometheus / Alertmanager integrations for alert and incident management.

Prerequisites

As this blog is part of a series, it builds on what we have explored in the previous posts. Knowledge of the telemetry stack TPG (Telegraf, Prometheus, Grafana) and the basics of metrics gathering and alerting is advised. These topics can all be explored or refreshed at the following links:

Advanced Alerting Overview

As perfectly stated in Xenia’s previous post, an alert is “an alarm or other signal of danger” and must be a “meaningful signal of urgency and not constant white noise that is often ignored”. There are many philosophies to alerting, but we tend to take a page from the Google SRE Book, specifically Ch. 6 – Monitoring Distributed Systems, as a guiding principle.

The power of metrics, and subsequently alerts generated from those metrics, can often encourage an “alert on all the things” behavior. And while it looks great on a coverage spreadsheet, I have found that ultimately it leads to alert oversaturation and on-call exhaustion. As an observability team, we must find a way to design meaningful alert and response contracts with our stakeholders. And as painful as it might sound, not every alert is critical. An overuse of critical or emergency will only serve to create the Cry-Wolf Phenomenon. In other words, assign severity with an overabundance of caution.

We will be exploring just a few concepts here that can turbocharge your alerting, keep your team sane, and perhaps work towards that ever lofty goal of simplicity over complexity.

Deadman Switch

Ahh, the infamous deadman switch. A powerful technique with a grotesque name that you might have interacted with at some point in your daily life! If you have ever operated a lawn mower, ridden a jet ski, or taken the subway, you have interacted with a deadman switch. It’s essentially a safety feature to disable the machine if the human operator becomes disabled for whatever reason.

The deadman switch is typically used in monitoring systems to indicate that something went wrong in your observability pipeline. It could be Telegraf failing to gather, Prometheus failing to store, or Alertmanager going offline. It can be a form of self-monitoring or watching the watcher.

The concept is actually quite simple: send an alert when a metric we expect to be there isnt! To be clear, we are not interested in the value of the metric but rather whether it ceases to exist.

Let’s look at some examples.

Here I will stop my Telegraf monitoring container, thus eliminating the gathering of interface metrics for device ceos-01. We will take advantage of the Absent() function, which returns an empty vector if the metric exists or a 1-element vector with the value of 1 if it does not exist. The screenshot shows the times and graph of our now missing metrics.

Deadman Graph

This could indicate that the device stopped responding to polling for numerous reasons, or if corresponded with an up{job="telegraf"} != 1, we could see whether the actual Telegraf poller stopped, which is exactly what happened.

Here is what an example alerting rule in Prometheus might look like utilizing Absent().

  - name: Missing Device Metrics
    rules:
      - alert: MissingDeviceMetrics
        expr: absent(interface_admin_status)
        for: 2m
        labels:
          severity: high
          source: telegraf
          environment: Production
        annotations:
          summary: "Device metrics not being collected"
          description: "Metrics for {{ $labels.device }} are missing. Check device or collector"

Now that you have seen an example of a deadman alert, can you think of other ways you would use this in a metrics pipeline? Remember, alert when something is missing!

Recording Rules

“Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. Querying the precomputed result will then often be much faster than executing the original expression every time it is needed.” – Prometheus Docs

A great example of recording rules would be pre-calculating the rates of interface traffic over a period of time and then storing that as a separate metric for quick querying for alerting or graphing. In this example, I will set up a recording rule to gather the inbound interface traffic, but from six hours ago! This will allow us to graph historical on top of current, which could be done easily enough in this example with a query. However, think of the recording rule where you could compare traffic week by week or over the last month! This opens the doors for seasonality in your alerts.

Here is our recording rule. We take the rate of interface_in_octets, offset by 6H, and multiply by 8 to change our unit back to bps.

---
groups:
  - name: recording
    rules:
      - record: interface:interface_in_octets:rate6h
        expr: rate(interface_in_octets[5m] offset 6h) *8

It is difficult to see, but the dotted blue line is the mgmt0 traffic from six hours ago. 

Recording Rule Historical Traffic

Standard Deviation and Anomaly Detection

Let’s pretend that we have been tasked with creating a rule to alert on network device CPU usage for multiple device vendors in our environment. One manufacturer might set the “normal” CPU load at anything less than 80%, while another might consider anything higher than 60% to be a problem. How can we solve this without creating tens if not hundreds of threshold rules and variations of these rules? Answer: Standard deviation.

“In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values.[1] A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.” – Wikipedia

Standard deviation is a fantastic method to implement to find potentially anomalous behavior and further free ourselves from rule and alert overload. Instead of very tedious and specific threshold alerts, we can rely on basic statistics that are more generic, thus encompassing potentially more uses.

Another thing to note here is that threshold alerting is a fantastic (sarcasm) way to create false positives that will ultimately have your team ignoring these alerts and suffering mightily during an on-call rotation. In other words, use them with caution!

What does this mean? If we record (remember those recording rules?) the long running average of what we are interested in, we can determine whether the current (now) values are outside of the average (mean) by a number of deviations, which is called the z-score.

Here is an example of Percentage CPU used on a device (green line) and the plot of the standard deviation for a small average window (blue line). In this example, it is quite easy to see the CPU usage anomalies when the blue line exceeds ~3 on the right axis.

 Stddev Graph

For z-score parameters, “Based on the statistical principles of normal distributions, we can assume that any value that falls outside of the range of roughly +3 to -3 is an anomaly.” – GitLab Anomaly Detection Using Prometheus

Therefore, we can create rules to record and alert when we have a z-score outside of +3/-3.

  - record: cpu:cpu_used:avg1d
    expr: avg_over_time(cpu_used[1d])

  - record: cpu_stddev:cpu_used:stddev1d
    expr: stddev_over_time(cpu_used[1d])
  - name: CPU Used Anomaly
    rules:
      - alert: CpuUsedAnomaly
        expr: abs((avg_over_time(cpu_used[5m]) - cpu:cpu_used:avg1d) / cpu_stddev:cpu_used:stddev1d) >= 3
        for: 1m
        labels:
          severity: medium
          source: telegraf
          environment: Production
        annotations:
          summary: "Potential CPU Usage Anomaly Detected"
          description: "A CPU usage anomaly possibly detected for {{ $labels.device }} on {{ $labels.name }}"

Prometheus and Alertmanager Integrations

Prometheus and its alerting component Alertmanager benefit from a great number of popular integrations that can be leveraged by organizations. What exactly are integrations? Let’s use the example of the popular organization messaging application, Slack. Alertmanager has an integration to send messages to a Slack workspace and channel with a highly customizable message format.

Here is a short list of Alertmanager integrations:

  • email
  • opsgenie
  • pagerduty
  • slack
  • VictorOps
  • webhook

Prometheus itself has a great number of alert integrations available via its webhook receiver that can be explored here.

Another way that integrations can work with Alertmanager is if they are designed to utilize the Alertmanager API. One very useful tool for visualizing alerts comes to mind here: Karma. Karma is designed to visualize alerts using a very modern and unique method of grouping. You can take some action against the alerts, but it is probably best used as a visualization dashboard.

This brings us to Alerta. Let’s dive into Alerta, shall we?

Alerta

Alerta Dashboard 

Alerta is a fully integrated alerting dashboard that allows NOC/SRE/NRE users to perform actions against alerts, create notes, create blackout windows, and generate reports. It supports a multitude of authentication and authorization mechanisms, alert grouping and correlation, and a rich API.

Configuring Alerta is as simple as defining it with a webhook_receiver in Alertmanager. For example:

  receivers:
  - name: alerta
    webhook_configs:
    - send_resolved: true
      url: http://alerta-01:8080/api/webhooks/prometheus

So, let’s see it in action with our own alerts!

Here, I cause a CPU usage anomaly with a slightly lower threshold for the purpose of actually generating an alert easily. First, the alert is detected and sent to Alertmanager by Prometheus. 

Prometheus CPU Anomaly

Then, Alerta routes the alert to Alerta based on the labels in the alert. Here, we see the alert list, and you can see the CPU Anomaly Alert.

Clicking the alert will display it in detail along with all of the label sets and any other associated data. 

Alerta CPU Alert List Detail

PagerDuty

PagerDuty is an industry-leading incident management system with over 650 integrations! It handles incidents, runbook automation, on-call, and bizops, all from a single SaaS platform.

In this section, I will demonstrate just how easy it is to integrate it with their EventsV2 endpoint that is fully supported by Alertmanager. We will configure our example to only send to PagerDuty for events labeled with critical. It is crucial to think of your on-call staff, SLAs, and alert exhaustion (not to mention on-call PTSD). I always try to approach severity classification with the following mantra: “Is this serious enough to wake someone up at 3am to respond?” Again, I fall back to the Google SRE Book, specifically Ch4. Service Level Objectives.

Alertmanager Config

<span role="button" tabindex="0" data-code="— global: resolve_timeout: 30m route: # Let's set a default route, as required receiver: alerta routes: – group_by: – alertname match: source: stack receiver: alerta – group_by: – alertname match: source: stack severity: critical receiver: pagerduty receivers: – name: pagerduty pagerduty_configs: – routing_key:
---
global:
resolve_timeout: 30m
route:
# Let's set a default route, as required
  receiver: alerta 
    routes:
    - group_by:
      - alertname
      match:
        source: stack
      receiver: alerta
    - group_by:
      - alertname
      match:
        source: stack
        severity: critical
      receiver: pagerduty
  
receivers:
- name: pagerduty
  pagerduty_configs:
  - routing_key: <your_pager_duty_eventsv2_routing_key>

- name: alerta
  webhook_configs:
  - url: http://alerta-01:8080/api/webhooks/prometheus
  send_resolved: true
job-down
pagerduty-incident-1536x372

Conclusion

Phew! Now that was a lot to cover in just one blog! As we could easily go down the rabbit hole on each of these topics, I will be providing a list of links for follow-up reading, especially around the anomaly detection, as it is an entire blog unto itself.

Together, we explored some advanced alerting concepts, such as the Deadman, where we learned that we could alert on missing metrics. Then came Recording Rules and their power to store pre-computed metrics that would otherwise become computationally expensive to query. These same recording rules then enabled us to move on to our next topic, standard deviation. That demonstrated how to get out of the threshold alert rule game and started exploring Standard Deviation based alerts that have the power to alert us to anomalous behavior.

Finally, our guest stars of the hour: We took a look at two of our favorite Prometheus alerting integrations (with an honorable mention of Karma) here at NTC, Alerta and PagerDuty. We saw how to leverage the power of Alerta for alert management with RBAC and hand-off and how to page our on-call staff when things are really critical.

I hope you enjoyed this blog post. Stay tuned for the next installment in our telemetry series! Rumor has it, a wild antlered animal is the main star!

-David Richey

Links for further reading:



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Introduction to a Telemetry Stack – Part 2

Blog Detail

Just like The Fast and the Furious movies, we are going to be churning out sequels like no other! Welcome to Part 2 of the Telemetry Stack! series, where we walk you through the different stages of bringing insight into your infrastructure. Although there won’t be a special appearance from Ludacris in this sequel, you are in for a heck of a ride!

In this post we will focus on the concept of normalizing data between multiple systems and adding value with enrichment. To help follow along with some of the keywords used in this post, I recommend checking out Part 1 written by Nikos Kallergis for a refresher.

Normalization and Enrichment

During Part 1 we discussed the TPG stack, its different layers, and how to get started with Telegraf. Now it’s time to talk about processing those metrics into something more useful!

Normalization and Enrichment

Have you ever run into the issue where different versions of software return different metric names like bgp_neighbor versus bgp-neighbor? What about metrics that don’t quite have all the data you’d like? This is where processing can help solve a lot of headaches by allowing you to normalize and enrich the metrics before passing them into your database.

Normalizing Data

One of the toughest situations to work with in telemetry is that almost every vendor is different. This means that sometimes your BGP metrics can come in with different labels or fields, which can introduce all kinds of trouble when trying to sort them in graphs or alerting. Normalizing the data allows you to adjust different fields and labels to either tune them to your environment, or to enforce naming standards.

Enriching Data

Enriching data can be very powerful and can take your metrics to a whole new level. Sure, some vendors do an amazing job at returning all the data you need, but what about the data that they can’t provide? With data enrichment you can add labels or fields to your metrics to track things like site location, rack location, customer IDs, and even SLA information for tenants.

NOTE:
Prometheus uses labels to determine the uniqueness of a metric. If you change the label of an existing metric, you may lose graph history in Grafana. You would need to update your query to pull for both the old and new labels so that they are combined.

Normalizing Data Using Telegraf

Using our scenario from above, let’s normalize some BGP data and modify a few metric fields to make sure they match and are standard across the board.

[[processors.rename]]

  # ---------------------------------------------------
  # Normalize BGP Data
  # ---------------------------------------------------
  [[processors.rename]]
    order = 1
    namepass = ["bgp*"]

    [[processors.rename.replace]]
      field = "installed"
      dest = "prefixes_installed"

    [[processors.rename.replace]]
      field = "sent"
      dest = "prefixes_sent"

    [[processors.rename.replace]]
      field = "received"
      dest = "prefixes_received"

It looks like a bit of a mess at first; but if you look closely, it’s pretty straightforward. [[processors.rename]]

  • order allows us to set the order in which processors are executed. It’s not required; but if you don’t specify, the order will be random.
  • namepass is an array of glob pattern strings. Only measurements’ names that match this pattern will be emitted.

With a simple processor like this, we are able to catch any BGP fields that come in as installed and transform them into prefixes_installed to ensure they match our metrics pulled from other agents.

- bgp_neighbor{installed="100", sent="100", received="150", neighbor="10.17.17.1"} 1
+ bgp_neighbor{prefixes_installed="100", prefixes_sent="100", prefixes_received="150", neighbor="10.17.17.1"} 1

[[processors.enum]]

Another powerful processor in Telegraf is enum. The enum processor allows the configuration of value mappings for field or tag values. The main use for this is for creating a mapping between strings and integers.

  # ---------------------------------------------------
  # Normalize status codes
  # ---------------------------------------------------
  [[processors.enum]]
    order = 3
    namepass = ["storage"]

    [[processors.enum.mapping]]
      tag = "status"
      [processors.enum.mapping.value_mappings]
        1 = "READ_ONLY"
        2 = "RUN_FROM_FLASH"
        3 = "READ_WRITE"

With this enum config, all storage metrics will have their status tag updated so that the end result is no longer a number and is easier to read.

- storage{type="scsi", status="1", host="server01"} 1500000
+ storage{type="scsi", status="READ_ONLY", host="server01"} 1500000

Sometimes even simple normalizations can save you from some of those dreaded late-night calls from your NOC. Changing a field into a more user-friendly field will prevent a lot of headaches during outages as well.

Enriching Data Using Telegraf

When it comes to enrichment you can either perform a what we call a static enrichment or a dynamic enrichment. Static enrichment is based on the Telegraf configuration file which means it is valid during the lifecycle of the configuration. Sometimes we like flexibility and not have a dependency on configuration or Telegraf deployments which is where dynamic enrichment comes in.

Static Enrichment

Telegraf has a lot of processors for enrichment but we will focus on the regex plugin. This plugin allows you to match a particular pattern for creating static labels and values.

  [[processors.regex]]
    order = 3
    namepass = ["interface_admin_status"]

    [processors.regex.tagpass]
      device = ["ceos-01"]

    [[processors.regex.tags]]
        key = "interface"
        pattern = "^Management1$"
        replacement = "mgmt"
        result_key = "intf_role"
- interface_admin_status{device="ceos-01", interface="Management1"} 1
+ interface_admin_status{device="ceos-01", interface="Management1", intf_role="mgmt"} 1

This is great, but wouldn’t it be better if this label could be updated with a change inside Nautobot? Well, this is where dynamic enrichment comes in.

Dynamic Enrichment

With dynamic enrichment we can take it a step further by pulling values from a single source of truth like Nautobot. In the next example I will be giving you a glance into an upcoming project that’s still currently in work but hopefully will be released soon so keep a lookout for the blog post!

Let me give you a sneak peek into network-agent. The network-agent project is built as a ‘batteries included’ Telegraf/Python-based container targeted for network metrics consumption and processing. The network-agent container comes with a lot of features, but for now we will only focus on the Nautobot processor.

Key features of this processor:

  • GraphQL-based queries to Nautobot for simplicity and speed.
  • JMESPath query for easy data extraction.
  • LRU caching to reduce API calls for metric enrichment.

NOTE:
The default cache TTL is set to 120 seconds. This means that the cache will remain valid until this timer has passed. After that, another GraphQL query to Nautobot is sent to check for new interfaces and roles.

This is what the configuration can look like:

[nautobot]
# Nautobot URL and Token specified using environment variables
graphql_query = """
query ($device: [String!]) {
  devices(name: $device) {
    name
    interfaces(tag: "intf_pri__tier1") {
      name
      cf_role
      tags {
        name
      }
    }
  }
}
"""

[enrich.interface.tag_pass]
  device = 'ceos-*'
  name = "interface*"

[enrich.interface.extract]  # JMESPATH
  interface_role = "devices[0].interfaces[?name==''].cf_role | [0]"

With this processor, we are able to query Nautobot for devices and filter the results to only interfaces with a intf_pri__tier1 tag. The information is then cached and can be used during the enrichment process.

[enrich.interface.tag_pass]

With the device and name options, we are able to control which specific metrics will get enriched with our new label.

[enrich.interface.extract]

This is where we define our new label that will get added to the metrics and the JMESPath query to grab our value. In this case, we will be taking the custom field called role out of Nautobot and adding it to all our interface metrics for our ceos devices.

- interface_admin_status{device="ceos-01", interface="Ethernet1"} 1
+ interface_admin_status{device="ceos-01", interface="Ethernet1", interface_role="border"} 1

Conclusion

Metric labels can be extremely powerful for both troubleshooting global infrastructure and capacity planning for companies. Whether you are using enrichment to add customer_id to BGP metrics or using normalization to remove those pesky special characters from your interface descriptions, telemetry can do it all.

-Donnie



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Introduction to a Telemetry Stack – Part 1

Blog Detail

You know that “bookending” technique in movies like Fight Club and Pulp Fiction where they open up with the chronologically last scene? Well, this is more or less the same, but instead of a movie it’s the Telemetry Stack! NTC blog post series. And this is what your network monitoring solution could look like. Well, maybe not exactly like this but you get the idea, right?

Telemetry Stack

This series of blog posts that will be released during the coming weeks dives into detail on how to build a contemporary network monitoring system using widely accepted open-source materials like TelegrafPrometheusGrafana, and Nautobot. Also, lots of “thanks” go to Josh VanDeraa and his very detailed blog posts that have served as a huge inspiration for this series!

In part 1 of the series, we’ll focus on two aspects:

  • Introduction to the Telemetry Stack!
  • Capturing metrics and data using Telegraf

Introduction to the Telemetry Stack!

During the blog post series, we’ll explore the components that comprise the TPG (Telegraf, Prometheus, Grafana) Telemetry Stack! 

stack_archit

Telegraf

In short, Telegraf is a metrics and events collector written in Go. Its simple architecture and plethora of built-in modules allow for easy capturing of metrics (input modules), linear processing (processor modules), and storing them to a multitude of back-end systems (output modules). For the enrichment part, Nautobot will be leveraged as the Source of Truth (SoT) that holds the additional information.

Prometheus

As already mentioned, Prometheus will be the TSDB of choice for storing our processed metrics and data.

Grafana

In order to present the data in a human-friendly form, we’ll be visualizing them using Grafana. Dashboard design is an art of its own, but we’ll attempt to present a basic set of dashboards that could serve as inspiration for creating your own.

Alertmanager

Using Grafana dashboards sprinkled with sane thresholds, we’re able to create alerting mechanisms triggered whenever a threshold is crossed.

Capturing Metrics and Data Using Telegraf

The rest of this post is dedicated on using Telegraf to get data from network devices. Now, one (or probably more) may wonder “why Telegraf?” and that’s indeed a question we also asked ourselves when designing this series. We decided to go with Telegraf based on the following factors:

  • it works and we like it
  • its configuration is comparatively easy to generate using templates
  • it’s a very flexible solution, allowing us to do SNMP, gNMI, and also execute Python scripts
  • it uses a very simple basic flow between its components
telegraf_base_archit

Inputs: Telegraf provides a ton of various input methods out of the box, like snmpgnmi, and execd. These are the components that enable us to capture metrics and data from our targets. Incidentally, this is also the main topic in the second half of this post, so more details may be found there.

Processors: As with inputs, Telegraf also comes loaded with a bunch of processor modules that allow us to manipulate our collected data. In this blog post series, our focus will be normalization and enrichment of the captured metrics. This will be the main topic in the second post of the series.

Outputs: Last, once we’re happy with the processors result, we use outputs to store the transformed data. Three of the most common Time Series Databases (TSDBs), used for storing metrics are InfluxDB (from the same manufacturer as Telegraf), Elasticsearch, and of course Prometheus which will be the output used throughout the series.

For the purposes of this blog post, we’ll be using two Arista cEOS machines. So, with all that out of our way, let’s dive into our various input methods!

[[inputs.snmp]]

Like the old-time gray-beards that we are, we’ll begin our journey using SNMP to perform a simple metrics capture.

# ------------------------------------------------
# Input - SNMP
# ------------------------------------------------
[[inputs.snmp]]
  agents = ["ceos-02"]
  version = 2
  community = "${SNMPv2_COMMUNITY}"
  interval = "60s"
  timeout = "10s"
  retries = 3

  [inputs.snmp.tags]
    collection_method = "snmp"
    device = "ceos-02"
    device_role = "router"
    device_platform = "arista"
    site = "lab-site-01"
    region = "lab"
    net_os = "eos"
  # ------------------------------------------------
  # Device Uptime (SNMP)
  # ------------------------------------------------
  [[inputs.snmp.field]]
    name = "uptime"
    oid = "RFC1213-MIB::sysUpTime.0"

  # ----------------------------------------------
  # Device Storage Partition Table polling (SNMP)
  # ----------------------------------------------
  [[inputs.snmp.table]]
    name = "storage"

    # Partition name
    [[inputs.snmp.table.field]]
      name = "name"
      oid = "HOST-RESOURCES-MIB::hrStorageDescr"
      is_tag = true

    # Size in bytes of the data objects allocated to the partition
    [[inputs.snmp.table.field]]
      name = "allocation_units"
      oid = "HOST-RESOURCES-MIB::hrStorageAllocationUnits"

    # Size of the partition storage represented by the allocation units
    [[inputs.snmp.table.field]]
      name = "size_allocation_units"
      oid = "HOST-RESOURCES-MIB::hrStorageSize"

    # Amount of space used by the partition represented by the allocation units
    [[inputs.snmp.table.field]]
      name = "used_allocation_units"
      oid = "HOST-RESOURCES-MIB::hrStorageUsed"

Note: A detailed line-by-line explanation of the Telegraf configuration may be found in the excellent Network Telemetry for SNMP Devices blog post by Josh.

[[inputs.gnmi]]

Now that we’ve covered the capabilities of old-school MIB/OID-based NMS, let’s jump to what all the cool kids are playing with these days: gRPC (Remote Procedure Calls) Network Management Interface. Bit of a mouthful! The main benefits of using gNMI for telemetry are its speed and efficiency. Thanks to the magic of Telegraf, we are able to capture data with just a few more lines of code.

# ------------------------------------------------
# Input - gNMI
# ------------------------------------------------
[[inputs.gnmi]]
  addresses = ["ceos-02:50051"]
  username = "${NETWORK_AGENT_USER}"
  password = "${NETWORK_AGENT_PASSWORD}"
  redial = "20s"
  tagexclude = [
      "identifier",
      "network_instances_network_instance_protocols_protocol_name",
      "afi_safi_name",
      "path",
      "source"
  ]

  [inputs.gnmi.tags]
    collection_method = "gnmi"
    device = "ceos-02"
    device_role = "router"
    device_platform = "arista"
    site = "lab-site-01"
    region = "lab"
    net_os = "eos"

  # ---------------------------------------------------
  # Device Interface Counters (gNMI)
  # ---------------------------------------------------
  [[inputs.gnmi.subscription]]
    name = "interface"
    path = "/interfaces/interface/state/counters"
    subscription_mode = "sample"
    sample_interval = "10s"

  [[inputs.gnmi.subscription]]
    name = "interface"
    path = "/interfaces/interface/state/admin-status"
    subscription_mode = "sample"
    sample_interval = "10s"

  [[inputs.gnmi.subscription]]
    name = "interface"
    path = "/interfaces/interface/state/oper-status"
    subscription_mode = "sample"
    sample_interval = "10s"

  # ---------------------------------------------------
  # Device Interface Ethernet Counters (gNMI)
  # ---------------------------------------------------
  [[inputs.gnmi.subscription]]
    name = "interface"
    path = "/interfaces/interface/ethernet/state/counters"
    subscription_mode = "sample"
    sample_interval = "10s"

  # ----------------------------------------------
  # Device CPU polling (gNMI)
  # ----------------------------------------------
  [[inputs.gnmi.subscription]]
    name = "cpu"
    path = "/components/component/cpu/utilization/state/instant"
    subscription_mode = "sample"
    sample_interval = "10s"

  # ----------------------------------------------
  # Device Memory polling (gNMI)
  # ----------------------------------------------
  [[inputs.gnmi.subscription]]
    name = "memory"
    path = "/components/component/state/memory"
    subscription_mode = "sample"
    sample_interval = "10s"

Note: Another great blog from Josh touches on the same subjects: Monitor Your Network With gNMI, SNMP, and Grafana.

[[inputs.execd]]

The third and last input that we’ll examine in this blog is execd; even cooler than the gNMI cool kids! Practically, it’s a way for Telegraf to run scripts and capture the data that are output in a metrics format. In our case though, the script is in reality a container image that makes it easy to collect metrics using a multitude of different methods. In the following simple example, it is used to collect BGP information over EOS RESTful API.

# ------------------------------------------------
# Input - Execd command
# ------------------------------------------------
[[inputs.execd]]
  interval = "60s"
  signal = "SIGHUP"
  restart_delay = "10s"
  data_format = "influx"
  command = [
    '/usr/local/bin/network_agent',
    '-h',
    'ceos-02',
    '-d',
    'arista_eos',
    '-c',
    'bgp_sessions::http',  ]
  [inputs.execd.tags]
    collection_method = "execd"
    device = "ceos-02"
    device_role = "router"
    device_platform = "arista"
    site = "lab-site-01"
    region = "lab"
    net_os = "eos"

Collected Data

At this point, we’ve managed to collect various metrics and data from our target devices. We still have to process them, store them, and visualize them but these are the topics of the blog posts that will follow. For now, we may take a peek into the collected data to verify that they’ve been captured successfully, before “passing” them to the normalization/enrichment step of the process.

[[inputs.snmp]]

# HELP snmp_uptime Telegraf collected metric
# TYPE snmp_uptime untyped
snmp_uptime{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",net_os="eos",region="lab",site="lab-site-01"} 14082

# HELP storage_allocation_units Telegraf collected metric
# TYPE storage_allocation_units untyped
storage_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Core",net_os="eos",region="lab",site="lab-site-01"} 4096
storage_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Flash",net_os="eos",region="lab",site="lab-site-01"} 4096
storage_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Log",net_os="eos",region="lab",site="lab-site-01"} 4096
storage_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM",net_os="eos",region="lab",site="lab-site-01"} 1024
storage_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM (Buffers)",net_os="eos",region="lab",site="lab-site-01"} 1024
storage_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM (Cache)",net_os="eos",region="lab",site="lab-site-01"} 1024
storage_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM (Unavailable)",net_os="eos",region="lab",site="lab-site-01"} 1024
storage_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Root",net_os="eos",region="lab",site="lab-site-01"} 4096
storage_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Tmp",net_os="eos",region="lab",site="lab-site-01"} 4096
# HELP storage_size_allocation_units Telegraf collected metric
# TYPE storage_size_allocation_units untyped
storage_size_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Core",net_os="eos",region="lab",site="lab-site-01"} 400150
storage_size_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Flash",net_os="eos",region="lab",site="lab-site-01"} 2.5585863e+07
storage_size_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Log",net_os="eos",region="lab",site="lab-site-01"} 400150
storage_size_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM",net_os="eos",region="lab",site="lab-site-01"} 1.6005976e+07
storage_size_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM (Buffers)",net_os="eos",region="lab",site="lab-site-01"} 1.6005976e+07
storage_size_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM (Cache)",net_os="eos",region="lab",site="lab-site-01"} 1.6005976e+07
storage_size_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM (Unavailable)",net_os="eos",region="lab",site="lab-site-01"} 1.6005976e+07
storage_size_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Root",net_os="eos",region="lab",site="lab-site-01"} 2.5585863e+07
storage_size_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Tmp",net_os="eos",region="lab",site="lab-site-01"} 16384
# HELP storage_used_allocation_units Telegraf collected metric
# TYPE storage_used_allocation_units untyped
storage_used_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Core",net_os="eos",region="lab",site="lab-site-01"} 0
storage_used_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Flash",net_os="eos",region="lab",site="lab-site-01"} 7.955702e+06
storage_used_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Log",net_os="eos",region="lab",site="lab-site-01"} 14989
storage_used_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM",net_os="eos",region="lab",site="lab-site-01"} 7.242028e+06
storage_used_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM (Buffers)",net_os="eos",region="lab",site="lab-site-01"} 953836
storage_used_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM (Cache)",net_os="eos",region="lab",site="lab-site-01"} 2.134188e+06
storage_used_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="RAM (Unavailable)",net_os="eos",region="lab",site="lab-site-01"} 4.148248e+06
storage_used_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Root",net_os="eos",region="lab",site="lab-site-01"} 7.955702e+06
storage_used_allocation_units{collection_method="snmp",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Tmp",net_os="eos",region="lab",site="lab-site-01"} 0

[[inputs.gnmi]]

# HELP cpu_instant Telegraf collected metric
# TYPE cpu_instant untyped
cpu_instant{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="CPU0",net_os="eos",region="lab",site="lab-site-01"} 3
cpu_instant{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="CPU1",net_os="eos",region="lab",site="lab-site-01"} 5
cpu_instant{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="CPU2",net_os="eos",region="lab",site="lab-site-01"} 3
cpu_instant{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="CPU3",net_os="eos",region="lab",site="lab-site-01"} 3

# HELP interface_in_broadcast_pkts Telegraf collected metric
# TYPE interface_in_broadcast_pkts untyped
interface_in_broadcast_pkts{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_crc_errors Telegraf collected metric
# TYPE interface_in_crc_errors untyped
interface_in_crc_errors{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_discards Telegraf collected metric
# TYPE interface_in_discards untyped
interface_in_discards{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_errors Telegraf collected metric
# TYPE interface_in_errors untyped
interface_in_errors{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_fcs_errors Telegraf collected metric
# TYPE interface_in_fcs_errors untyped
interface_in_fcs_errors{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_fragment_frames Telegraf collected metric
# TYPE interface_in_fragment_frames untyped
interface_in_fragment_frames{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_jabber_frames Telegraf collected metric
# TYPE interface_in_jabber_frames untyped
interface_in_jabber_frames{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_mac_control_frames Telegraf collected metric
# TYPE interface_in_mac_control_frames untyped
interface_in_mac_control_frames{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_mac_pause_frames Telegraf collected metric
# TYPE interface_in_mac_pause_frames untyped
interface_in_mac_pause_frames{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_multicast_pkts Telegraf collected metric
# TYPE interface_in_multicast_pkts untyped
interface_in_multicast_pkts{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 25
# HELP interface_in_octets Telegraf collected metric
# TYPE interface_in_octets untyped
interface_in_octets{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 4349
# HELP interface_in_oversize_frames Telegraf collected metric
# TYPE interface_in_oversize_frames untyped
interface_in_oversize_frames{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_in_unicast_pkts Telegraf collected metric
# TYPE interface_in_unicast_pkts untyped
interface_in_unicast_pkts{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 21
# HELP interface_out_broadcast_pkts Telegraf collected metric
# TYPE interface_out_broadcast_pkts untyped
interface_out_broadcast_pkts{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_out_discards Telegraf collected metric
# TYPE interface_out_discards untyped
interface_out_discards{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_out_errors Telegraf collected metric
# TYPE interface_out_errors untyped
interface_out_errors{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_out_mac_control_frames Telegraf collected metric
# TYPE interface_out_mac_control_frames untyped
interface_out_mac_control_frames{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_out_mac_pause_frames Telegraf collected metric
# TYPE interface_out_mac_pause_frames untyped
interface_out_mac_pause_frames{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_out_multicast_pkts Telegraf collected metric
# TYPE interface_out_multicast_pkts untyped
interface_out_multicast_pkts{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_out_octets Telegraf collected metric
# TYPE interface_out_octets untyped
interface_out_octets{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0
# HELP interface_out_unicast_pkts Telegraf collected metric
# TYPE interface_out_unicast_pkts untyped
interface_out_unicast_pkts{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Ethernet1",net_os="eos",region="lab",site="lab-site-01"} 0

# HELP memory_available Telegraf collected metric
# TYPE memory_available untyped
memory_available{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Chassis",net_os="eos",region="lab",site="lab-site-01"} 1.6390119424e+10
# HELP memory_utilized Telegraf collected metric
# TYPE memory_utilized untyped
memory_utilized{collection_method="gnmi",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",name="Chassis",net_os="eos",region="lab",site="lab-site-01"} 7.414636544e+09

[[inputs.execd]]

# HELP bgp_sessions_prefixes_received Telegraf collected metric
# TYPE bgp_sessions_prefixes_received untyped
bgp_sessions_prefixes_received{collection_method="execd",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",local_as="65111",neighbor_address="10.1.2.2",net_os="eos",peer_as="65222",peer_router_id="10.17.17.2",peer_type="external",region="lab",router_id="10.17.17.1",routing_instance="default",session_state="established",site="lab-site-01"} 1
bgp_sessions_prefixes_received{collection_method="execd",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",local_as="65111",neighbor_address="10.1.7.2",net_os="eos",peer_as="65222",peer_router_id="0.0.0.0",peer_type="external",region="lab",router_id="10.17.17.1",routing_instance="default",session_state="active",site="lab-site-01"} 0
# HELP bgp_sessions_prefixes_sent Telegraf collected metric
# TYPE bgp_sessions_prefixes_sent untyped
bgp_sessions_prefixes_sent{collection_method="execd",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",local_as="65111",neighbor_address="10.1.2.2",net_os="eos",peer_as="65222",peer_router_id="10.17.17.2",peer_type="external",region="lab",router_id="10.17.17.1",routing_instance="default",session_state="established",site="lab-site-01"} 1
bgp_sessions_prefixes_sent{collection_method="execd",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",local_as="65111",neighbor_address="10.1.7.2",net_os="eos",peer_as="65222",peer_router_id="0.0.0.0",peer_type="external",region="lab",router_id="10.17.17.1",routing_instance="default",session_state="active",site="lab-site-01"} 0
# HELP bgp_sessions_session_state_code Telegraf collected metric
# TYPE bgp_sessions_session_state_code untyped
bgp_sessions_session_state_code{collection_method="execd",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",local_as="65111",neighbor_address="10.1.2.2",net_os="eos",peer_as="65222",peer_router_id="10.17.17.2",peer_type="external",region="lab",router_id="10.17.17.1",routing_instance="default",session_state="established",site="lab-site-01"} 6
bgp_sessions_session_state_code{collection_method="execd",device="ceos-01",device_platform="arista",device_role="router",environment="dev",host="telegraf-01",local_as="65111",neighbor_address="10.1.7.2",net_os="eos",peer_as="65222",peer_router_id="0.0.0.0",peer_type="external",region="lab",router_id="10.17.17.1",routing_instance="default",session_state="active",site="lab-site-01"} 3

Conclusion

The best parts of the Telemetry Stack! series are yet to come, so stay tuned! For any question you may have, feel free to join our Slack community.

-Nikos



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!