Alerting with Prometheus

Blog Detail

Over the past several posts, I have discussed how to gather metrics about your infrastructure and web applications. Now, the natural progression is to move into alerting with Prometheus. This post will build on the previous post on gathering website and DNS responses. I will be taking you through how to setup a rule whenever a website gives a response other than a 200 OK response. To accomplish this we will take a look at the metric http_response_http_response_code gathered via Telegraf.

Prometheus Setup

You configure rules in files and reference those file names within the Prometheus configuration. A common practice is to name the file alert.rules within the /etc/prometheus/ directory.

The following outlines what the file will contain. The alert rules will be defined by a YAML file that specifies the alert name (alert), expression (expr) to search for within Prometheus, and the time (for) that the event status meets the criteria. There are additional keys available as well, such as labels and annotations as demonstrated below:

groups:
- name: websites
  rules:
  - alert: WebsiteDown
    expr: http_response_http_response_code != 200
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "{{ $labels.instance }} is not responding with 200 OK."

This is what the configuration will look like for the prometheus.yml file. The rules file that is created above will be added to the array under the key rule_files. This will allow for multiple files to be processed by Prometheus.

global:
  scrape_interval: 15s

rule_files:
  - alert.rules

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'telegraf website'
    scrape_interval: 10s
    static_configs:
      - targets:
        - "localhost:9012"

Once the rules are loaded, you can verify the rules by going to the Prometheus url – http://<hostname_or_ip>:9090/rules. You will now see what rules are loaded:

prometheus_rules

Prometheus AlertManager

Now you have a configuration for the alerts, but how do you actually manage them? You’ll need to add an application into the environment, Prometheus AlertManager. AlertManager is where you will handle the silencing, deduplicating, grouping, and routing of alerts to the appropriate outputs. These destinations can include, but not limited to, Slack, email, or webhooks. The AlertManager configuration page has the details on how to make configuration for these:

  • Email
  • HipChat
  • PagerDuty
  • Pushover
  • Slack
  • OpsGenie
  • Webhook
  • VictorOps
  • WeChat

AlertManager Installation

Installation can be done in several ways. There are binaries available for many common platforms, Docker containers, and installation from source. In this demo, I will just be installing the binary via the installation using wget to download the file.

Once the file is downloaded, we will expand it within the directory:

tar -xzf alertmanager-0.20.0.linux-amd64.tar.gz

AlertManager Configuration

The AlertManager configuration is to be handled in the alertmanager.yml file. An example may look like:

route:
  group_by: [Alertname]
  # Send all notifications to me.
  receiver: email-me

receivers:
- name: email-me
  email_configs:
  - to: $GMAIL_ACCOUNT
    from: $GMAIL_ACCOUNT
    smarthost: smtp.gmail.com:587
    auth_username: "$GMAIL_ACCOUNT"
    auth_identity: "$GMAIL_ACCOUNT"
    auth_password: "$GMAIL_AUTH_TOKEN"

AlertManager Execution

To start this test instance of AlertManager the command ./alertmanager --config.file="alertmanager.yml" is executed to start AlertManager:

$ ./alertmanager --config.file="alertmanager.yml"
level=info ts=2020-05-21T15:14:56.850Z caller=main.go:231 msg="Starting Alertmanager" version="(version=0.20.0, branch=HEAD, revision=f74be0400a6243d10bb53812d6fa408ad71ff32d)"
level=info ts=2020-05-21T15:14:56.850Z caller=main.go:232 build_context="(go=go1.13.5, user=root@00c3106655f8, date=20191211-14:13:14)"
level=info ts=2020-05-21T15:14:56.859Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=10.250.0.83 port=9094
level=info ts=2020-05-21T15:14:56.868Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2020-05-21T15:14:56.883Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=alertmanager.yml
level=info ts=2020-05-21T15:14:56.883Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=alertmanager.yml
level=info ts=2020-05-21T15:14:56.885Z caller=main.go:497 msg=Listening address=:9093

You can see that the application starts up, and then the listening address port is displayed indicating in this instance the AlertManager is listening on port 9093.

Prometheus Alerts in Action

Now that the configuration has been called out, let’s take a look at how this looks put all together.

To see the status of the alerts within the Prometheus environment, you can navigate to the Alerts menu item, or to the URL http://<hostname_or_ip>:9090/alerts. Once there, the following image shows the status of each of the rules within the files that the rules are being added to.

alert_list

At this point there are no websites down. To confirm this in the Prometheus Graph you can search for ALERTS within the graph application of Prometheus. You should get the message No datapoints found. if you have nothing alerting. This will help you understand if you are receiving an alert status and it is being suppressed or if there is something else wrong with the configuration.

At this point I am going to have my DNS server deny access to the ServiceNow website. This will simulate the service unavailable

Prometheus Alerts Pending

After some time the website becomes non responsive. Next we can see within the Alerts management page that Prometheus was first in a waiting status that the website was down, but had not crossed the threshold for amount of time that was set (1 minute). You see the 1 pending rule.

prometheus_alert_start
prometheus_alert_start2

Prometheus Alerts Firing

Once the threshold that was defined has passed, the alert will move from a Pending state to a Firing state. In this state Prometheus has sent the alert off to AlertManager to handle the processing of the alert.

First, let’s take a look at the Prometheus Alerts page. This page shows that the alert has moved through to the Firing phase. This has the same information that you seen in the Pending state but now in the red state.

prometheus_firing1

Now, moving on to the Graph section of Prometheus and searching for ALERT, you can now see the lines along the way of the state of the ALERT.

At the start the graph with the mouse cursor over the section indicating when the event was in a Pending state. The second graph shows the mouse hovering over the Firing state. Each gives you additional information to help debug if alerts are not getting to their destination.

fired_graph_1
fired_graph_2

Prometheus AlertManager Firing

The last image is the view from the AlertManager perspective. This shows what alerts have been triggered and which tags are found within the search for the alert.

alert_mgr_fired

Summary

This wraps up (for now) this series of posts focused on leveraging Telegraf, Prometheus, and Grafana to monitor your environment. Take a look at the post list below for the others in the series and jump on into the Network to Code Slack Telemetry channel to start a conversation on what you are doing, what you want to do, or just to talk network telemetry!

Hope this has been helpful!

-Josh (@vanderaaj)



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Monitoring Websites with Telegraf and Prometheus

Blog Detail

In network service delivery, the network exists to have applications ride on it. Yes, even voice is considered an application when it is riding over the top of the network. We have explored in previous posts how to get telemetry data from your network devices to get an understanding of how they are performing from a device perspective. Now, in this post, I will move on to exploring how to monitor web applications and DNS using Telegraf, Prometheus, and Grafana. Often your operations teams will receive reports of websites not working for a user or you are just looking to get some more visibility into your own web services. The following method could be used to get more insight into the network and the name resolution required for those applications.

There are also several other Telegraf inputs available including ping (ICMP) and TCP tests. As of this post in May 2020 there are 181 different input plugins available to choose from. Take a look at the Telegraf plugins for more details and explore what other plugins you may be able to use to monitor your environment.

I will not be going into the setup of these tools, as this is already covered in a previous post. The previous posts in the series are:

These posts can help you get up and running when it comes to monitoring your network devices in CLI, SNMP, and gNMI.

Blackbox exporter from Prometheus is also a valid choice for this process, and I encourage you to try both the Telegraf and Blackbox exporters in your environment.

Sequence Diagram

sequence

Telegraf Setup – HTTP Response

Telegraf has the HTTP Response plugin that does exactly what we would be looking to use for gathering metrics about a HTTP response. This lets you define the list of websites that you wish to monitor, set options for proxy, response timeout, method, any data you may want to include in the body, and various responses. Take a look at the plugin documentation for more details. Here is the configuration that is going to get setup for this demonstration:

#####################################################
#
# Check on status of URLs
#
#####################################################
[[inputs.http_response]]
  urls = ["https://www.networktocode.com", "https://blog.networktocode.com", "https://www.service-now.com"]
  method = "GET"
  follow_redirects = true

#####################################################
#
# Export Information to Prometheus
#
#####################################################
[[outputs.prometheus_client]]
  listen = ":9012"
  metric_version = 2

Upon executing this, here are the relevant Prometheus metrics that we are gathering:

# HELP http_response_content_length Telegraf collected metric
# TYPE http_response_content_length untyped
http_response_content_length{method="GET",result="success",result_type="success",server="https://blog.networktocode.com",status_code="200"} 1.791348e+06
http_response_content_length{method="GET",result="success",result_type="success",server="https://www.networktocode.com",status_code="200"} 123667
http_response_content_length{method="GET",result="success",result_type="success",server="https://www.service-now.com",status_code="200"} 478636
# HELP http_response_http_response_code Telegraf collected metric
# TYPE http_response_http_response_code untyped
http_response_http_response_code{method="GET",result="success",result_type="success",server="https://blog.networktocode.com",status_code="200"} 200
http_response_http_response_code{method="GET",result="success",result_type="success",server="https://www.networktocode.com",status_code="200"} 200
http_response_http_response_code{method="GET",result="success",result_type="success",server="https://www.service-now.com",status_code="200"} 200
# HELP http_response_response_time Telegraf collected metric
# TYPE http_response_response_time untyped
http_response_response_time{method="GET",result="success",result_type="success",server="https://blog.networktocode.com",status_code="200"} 0.371015121
http_response_response_time{method="GET",result="success",result_type="success",server="https://www.networktocode.com",status_code="200"} 0.186775794
http_response_response_time{method="GET",result="success",result_type="success",server="https://www.service-now.com",status_code="200"} 0.658694795
# HELP http_response_result_code Telegraf collected metric
# TYPE http_response_result_code untyped
http_response_result_code{method="GET",result="success",result_type="success",server="https://blog.networktocode.com",status_code="200"} 0
http_response_result_code{method="GET",result="success",result_type="success",server="https://www.networktocode.com",status_code="200"} 0
http_response_result_code{method="GET",result="success",result_type="success",server="https://www.service-now.com",status_code="200"} 0

You have several pieces that come back right away including:

  • content_length: How long the content is
  • response_code: HTTP response code
  • response_time: How long did it take for the request to process
  • result_code: This is a function of Telegraf, to take an OK response to map to 0

Telegraf – DNS Check

dns_sequence

On top of this, I want to also show how to add in a second input. We will add in a DNS query to test the name resolution of the sites as well to verify that the DNS lookup is working as expected. This could also be extended to test and verify DNS from a user perspective within your environment.

#####################################################
#
# Check on status of URLs
#
#####################################################
[[inputs.http_response]]
  urls = ["https://www.networktocode.com", "https://blog.networktocode.com", "https://www.service-now.com"]
  method = "GET"
  follow_redirects = true

[[inputs.dns_query]]
  servers = ["8.8.8.8"]
  domains = ["blog.networktocode.com", "www.networktocode.com", "www.servicenow.com"]

#####################################################
#
# Export Information to Prometheus
#
#####################################################
[[outputs.prometheus_client]]
  listen = ":9012"
  metric_version = 2

The new section is:

[[inputs.dns_query]]
  servers = ["8.8.8.8"]
  domains = ["blog.networktocode.com", "www.networktocode.com", "www.servicenow.com"]

Based on the plugin definition we are going to define to use the Google DNS resolver. And the interesting domains that we are going to verify are blog.networktocode.com, www.networktocode.com, and the popular ITSM tool ServiceNow.

Here is what gets added to the Prometheus Client output:

# HELP dns_query_query_time_ms Telegraf collected metric
# TYPE dns_query_query_time_ms untyped
dns_query_query_time_ms{domain="blog.networktocode.com",rcode="NOERROR",record_type="NS",result="success",server="8.8.8.8"} 70.950858
dns_query_query_time_ms{domain="www.networktocode.com",rcode="NOERROR",record_type="NS",result="success",server="8.8.8.8"} 48.118903
dns_query_query_time_ms{domain="www.servicenow.com",rcode="NOERROR",record_type="NS",result="success",server="8.8.8.8"} 48.552328
# HELP dns_query_rcode_value Telegraf collected metric
# TYPE dns_query_rcode_value untyped
dns_query_rcode_value{domain="blog.networktocode.com",rcode="NOERROR",record_type="NS",result="success",server="8.8.8.8"} 0
dns_query_rcode_value{domain="www.networktocode.com",rcode="NOERROR",record_type="NS",result="success",server="8.8.8.8"} 0
dns_query_rcode_value{domain="www.servicenow.com",rcode="NOERROR",record_type="NS",result="success",server="8.8.8.8"} 0
# HELP dns_query_result_code Telegraf collected metric
# TYPE dns_query_result_code untyped
dns_query_result_code{domain="blog.networktocode.com",rcode="NOERROR",record_type="NS",result="success",server="8.8.8.8"} 0
dns_query_result_code{domain="www.networktocode.com",rcode="NOERROR",record_type="NS",result="success",server="8.8.8.8"} 0
dns_query_result_code{domain="www.servicenow.com",rcode="NOERROR",record_type="NS",result="success",server="8.8.8.8"} 0

The corresponding values gathered from the dns_query input are:

  • dns_query_query_time_ms: Amount of time it took for the query to respond
  • dns_query_rcode_value: Return code value for a DNS entry
  • dns_query_result_code: Code defined by Telegraf for the response

Prometheus

The configuration for Prometheus at this point has a single addition to gather the statistics for each of the websites:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'telegraf website'
    scrape_interval: 10s
    static_configs:
      - targets:
        - "localhost:9012"

When you navigate to the base page to check on how Prometheus is doing with polling the data you can get a base graph. Here you see that all three sites are appearing on the graph with respect to response time:

prometheus

Grafana

What does it look like to get this information into a graph on Grafana?

Grafana – Websites

grafana_graph

To build this chart, this is a small configuration. In the Metrics section I only put the query of http_response_response_time. With the legend I set it to {{ server }} to get the website address as the table legend.

In the visualization section, the only thing that is needs to be doneis to adjust in the Left Y Axis Unit to be seconds (s) to provide the proper Y-Axis Metric.

Grafana – DNS

dns_response_time

This is going to be another small configuration panel, similar to the previous one. In the Metrics section the corresponding query to get response time is dns_query_query_time_ms. The legend you then set to {{ domain }} to match that of what is in the query shown above.

In the visualization section, you should use the Unit of milliseconds (ms). If you copied the panel from the Website panel, don’t forget to change this. The unit of measure is in fact different and the time scale would be off.


Conclusion

Hopefully this post will help you gain some insight into your environment. We have been using this process internally at Network to Code already, keeping an eye on our key services that we rely on to understand if there is an individual issue or an issue with the service. Let us know your thoughts and comments! To continue the conversation, check out the #Telemetry channel inside the Network to Code Slack. Sign up at slack.networktocode.com.

-Josh



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Monitor Your Network With gNMI, SNMP, and Grafana

Blog Detail

This post, the second in a series focused on using Telegraf, Prometheus, and Grafana for Network Telemetry, will focus on transforming data and making additional graphs within Grafana. This post will cover the following topics:

  • Telegraf
    • Gathering streaming data with gNMI, as an alternative to SNMP
    • Changing data with Enum and Replacement
    • Tagging Data
  • Prometheus
    • Prometheus Query Language (PromQL)
  • Advancing Your Grafana Capabilities
    • Variables
    • Tables (BGP Table)
    • Device Dashboards vs Environment Dashboards

Here is where you can find the first post in the series on how to gather data from SNMP based devices.

Purpose

The intent of this post is to demonstrate how to bring multiple telemetry gathering methods into one. In our experience, a successful telemetry & analytics stack should be able to collect data transparently from SNMP, telemetry Streaming (gNMI) and CLI/API. We covered SNMP and CLI gathering in previous posts. This post will focus on gathering telemetry data with gNMI. Beyond the collection of data, when we are collecting the same type of data from multiple sources it’s important to ensure that the data will have the format in the database. In this post, we’ll look at how Telegraf can help normalize and decorate the data before sending it to the database.

Network Topology

topology

In the topology there is a mix of devices per the table below:

Device NameDevice TypeTelemetry Source
houstonCisco IOS-XESNMP
amarilloCisco NXOSSNMP
austinCisco IOS-XRgNMI
el-pasoCisco IOS-XRgNMI
san-antonioCisco IOS-XRgNMI
dallasCisco IOS-XRgNMI

This blog post was created based on a Cisco-only environment, but if you’re interested in a multi-vendor approach check out @damgarros’s NANOG 77 presentation on YouTube. That video shows how to use only gNMI to collect data from Arista, Juniper, and Cisco devices in a single place. This topology used here is meant to show the collection from multiple sources (SNMP + gNMI) in one.

Application Installs Note

Software installation was covered in the previous post in this series, and I recommend taking a look at either that post for the particular installation instructions, or heading over to the product page referenced in the introduction.

Overview

Here is the sequence of events that is being addressed in this post. I am starting with Telegraf gathering and collecting gNMI data from network devices. This is being processed into Prometheus metrics that will be scraped by a Prometheus server. Then Grafana will generate graphs on the data that is gathered and processed appropriately.

sequence_diagram

gNMI Introduction

gNMI stands for gRPC (Remote Procedure Calls) Network Management Interface. gRPC is a standard developed by Google that leverages HTTP/2 for transport using Protocol Buffers. gNMI is a gRPC-based protocol to get configuration and telemetry from a network device. All messages are defined as protocol buffers that intend to keep data as small as possible in the definition to be as efficient as possible. The data is serialized into the proper format by the device and sent off. This can hold quite a bit of information and is read by the receiver. You can take a look at the gNMI reference for more detailed information.

gNMI can handle not only telemetry data that this post is about, but also is intended to transport configuration about the device as well.

So why use gNMI? gRPC is incredibly fast and efficient at transmitting data, and by extension gNMI is also fast and efficient.

gNMI Cisco Configuration

gNMI is supported by many of today’s leading network vendors. As an example for configuring a Cisco IOS-XR device here are the configuration lines needed to enable gNMI in this demo environment:

grpc
 port 50000
 no-tls

Pretty straight to the point. If you wish to create a subscription model within the Cisco IOS-XR there are some more detailed configuration options available. Take a look at Cisco’s Guide to Configure Model-driven Telemetry

Telegraf

Gathering Streaming Data With gNMI

The first step that I will be walking through is setting up Telegraf to subscribe to gNMI data. This is specifically to collect telemetry data from IOS-XR devices in this lab scenario. With gNMI, like other streaming Telemetry subscriptions, you need to tell the network device that you want to subscribe to receive the data. The device will then send the periodic updates of telemetry data to the receiver. There is a periodic “keep-alive” message sent to keep the subscription active by the subscriber.

gnmi

gNMI Telegraf Configuration

Telegraf has a plugin that will take care of the subscription and the input section looks like the code below. Note that the subscription port is defined within the addresses section.

[[inputs.cisco_telemetry_gnmi]]
    addresses = ["dallas.create2020.ntc.cloud.tesuto.com:50000"]
    username = <redacted>
    password = <redacted>

    ## redial in case of failures after
    redial = "10s"
    tagexclude = ["openconfig-network-instance:/network-instances/network-instance/protocols/protocol/name"]

    [[inputs.cisco_telemetry_gnmi.subscription]]
        origin = "openconfig-interfaces"
        path = "/interfaces/interface"

        subscription_mode = "sample"
        sample_interval = "10s"

    [[inputs.cisco_telemetry_gnmi.subscription]]
        name = "bgp_neighbor"
        origin = "openconfig-network-instance"
        path = "/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state"

        subscription_mode = "sample"
        sample_interval = "10s"

[[outputs.prometheus_client]]
  listen = ":9011"

The configuration shows that you define the address, username, and password. This configuration also shows a redial setup in case of a failure and particular subscriptions to be excluded from the request.

There are two subscriptions that we are subscribing to in this instance:

  • openconfig-interfaces
  • openconfig-network-instance (To collect BGP neighbor state)

In each of these cases the sampling will be every 10 seconds in this demo, which means that the device will send the statistics every 10 seconds. Every 10 seconds there will be new metrics available to be scraped by Prometheus. The sample interval and Prometheus scrape interval should be the same interval.

To collect the telemetry for this demo we are once again using the Prometheus client output from Telegraf. Telegraf will collect, process, and format the data that will then be scraped by a Prometheus server. Let’s take a look at what that output looks like next.

gNMI Output – BGP

I’m only going to take a look at a few of the items in the output here. There are too many that would fill up too much real estate in your screen to make it worthwhile.

# HELP bgp_neighbor_messages_received_UPDATE Telegraf collected metric
# TYPE bgp_neighbor_messages_received_UPDATE untyped
bgp_neighbor_messages_received_UPDATE{device="dallas",identifier="BGP",name="default",neighbor_address="10.0.0.1",peer_type="EXTERNAL",role="leaf"} 9
bgp_neighbor_messages_received_UPDATE{device="dallas",identifier="BGP",name="default",neighbor_address="10.0.0.17",peer_type="EXTERNAL",role="leaf"} 0
bgp_neighbor_messages_received_UPDATE{device="dallas",identifier="BGP",name="default",neighbor_address="10.0.0.25",peer_type="EXTERNAL",role="leaf"} 9
bgp_neighbor_messages_received_UPDATE{device="dallas",identifier="BGP",name="default",neighbor_address="10.0.0.9",peer_type="EXTERNAL",role="leaf"} 9

Some items were removed to assist in readability and message delivery

The output is what you would expect. A list of the neighbors identified by the neighbor_address key in the tags. With the BGP subscription you get:

  • bgp_neighbor_established_transitions
  • bgp_neighbor_last_established
  • bgp_neighbor_messages_received_NOTIFICATION
  • bgp_neighbor_messages_received_UPDATE
  • bgp_neighbor_messages_sent_NOTIFICATION
  • bgp_neighbor_messages_sent_UPDATE
  • bgp_neighbor_peer_as
  • bgp_neighbor_peer_as
  • bgp_neighbor_queues_output
  • bgp_neighbor_session_state

gNMI Output – Interfaces

There are a lot of statistics sent back with the interface subscription. We’ll be taking a look at just one of them, interface_state_counters_in_octets, in this instance. We get a look at each interface and its associated counter in the data.

interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/0",role="leaf"} 3.2022595e+07
interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/1",role="leaf"} 3.077077e+06
interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/2",role="leaf"} 1.5683204947e+10
interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/3",role="leaf"} 1.627459e+06
interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/4",role="leaf"} 1.523158e+06
interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/5",role="leaf"} 35606
interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/6",role="leaf"} 35318
interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/7",role="leaf"} 35550
interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/8",role="leaf"} 35878
interface_state_counters_in_octets{device="dallas",name="GigabitEthernet0/0/0/9",role="leaf"} 36684
interface_state_counters_in_octets{device="dallas",name="MgmtEth0/RP0/CPU0/0",role="leaf"} 2.2033861e+07
interface_state_counters_in_octets{device="dallas",name="Null0",role="leaf"} 0
interface_state_counters_in_octets{device="dallas",name="SINT0/0/0",role="leaf"} 0

This is great information, and we have seen something similar with SNMP. Now to the transformations that Telegraf offers.

Changing data with Enum and Replacement

Telegraf has a couple of different processors available to process the data and get it into a format that is appropriate and consistent for your environment. Let’s take a look at a couple of them and how they are used in the use case here.

Telegraf – Enum

The first processor used is within the BGP data collection. When the data comes back from the subscription for a BGP session state, it comes back as a string value. It is great to be able to read the current state, but is not very helpful for a Time Series Data Base (TSDB). A TSDB is looking to get the data back represented as a number of some sort, either an integer or a float. The whole point is to measure information at a point in time.

The Telegraf process then looks like this:

telegraf_process

To accommodate this, the use of the enum processor is put into action. The following is added to the configuration:

[[processors.enum]]
  [[processors.enum.mapping]]
    ## Name of the field to map
    field = "session_state"

    [processors.enum.mapping.value_mappings]
      IDLE = 1
      CONNECT = 2
      ACTIVE = 3
      OPENSENT = 4
      OPENCONFIRM = 5
      ESTABLISHED = 6

Within the session_state any instances of the string IDLE will be replaced with the integer 1. This is then set up to store for the long term within a TSDB. This is the same then for all of the rest of the states as well, with ESTABLISHED states stored as the integer 6. Later in Grafana this number will be reversed into the word for representation on a graph.

Telegraf – Rename

The second processor that is used in this demo is the rename processor. This rename processor has a function to replace items. Below is what is used to rename the SNMP counters that are collected for SNMP devices and moved to match the names for gNMI.

[[processors.rename]]
  [[processors.rename.replace]]
    field = "ifHCInOctets"
    dest = "state_counters_in_octets"

  [[processors.rename.replace]]
    field = "ifHCOutOctets"
    dest = "state_counters_out_octets"

This states that if looking for ifHCInOctets – replace the field with state_counters_in_octets. And the same for the outbound with ifHCOutOctets replacing with state_counters_out_octets. Once Telegraf has replaced those fields, you can use the data gathered with SNMP and that with gNMI in the same queries!

Tagging Data

Tagging data is one of the biggest favors that you can do for yourself. Tagging gives flexibility for future analysis, comparison, and graphing data points. For instance if you tag your BGP neighbors with the upstream peer provider, you will be able to easily identify the interfaces which belong to that particular peer. If you have four geographically diverse interfaces, this will allow you to quickly identify the interfaces based on the tag rather than manually deciding later at the time of graphing or alerting.

This brings us to the third Telegraf processor in this post regex processor. This processor will take a regex search pattern, and complete the replacement. Something new here is that if you use the result_key option, a new field will be created and not replace what is there, resulting in a whole new field. This regex replacement will add a new tag for intf_role using server as the definition.

  [[processors.regex.tags]]
    key = "name"
    pattern = "^GigabitEthernet0\/0\/0\/2$"
    replacement = "server"
    result_key = "intf_role"

Looking at just this particular replacement in the output, there are now additional tags for graphing, alerting, and general data analysis.

interface_state_admin_status{device="dallas",intf_role="server",name="GigabitEthernet0/0/0/2",role="leaf"} 1
interface_state_counters_in_broadcast_pkts{device="dallas",intf_role="server",name="GigabitEthernet0/0/0/2",role="leaf"} 8

Prometheus

Prometheus Query Language

Throughout the upcoming Grafana section you will get to see a number of PromQL (Prometheus Query Language) queries. Take a look at the Prometheus.io basics page to get full documentation of the queries that are available. It is these queries that are being executed that will be used by Grafana to populate the data in the graphs.

Grafana

Through the next several sections you will get to see how to build a dashboard using PromQL and variable substitution, among other topics to build these dashboards on a per device basis. From a device perspective dashboard, these two dashboards look different in the number of interfaces and neighbors displayed, but they are born out of the same dashboard configuration.

grafana_device_01
grafana_device_02

Variables

First, you’ll need to set up the variable device that is seen on the upper left hand corner of the dashboard. When I first started building dashboards I remember that this may be one of the most important skills when looking to level up your Grafana dashboards, as it will allow you to get significant amount of value while reducing re-work to keep adding additional devices into a panel.

Variables – Adding to Your Dashboard

To add a dashboard wide variable follow these steps:

  • Navigate into your dashboard
  • Click on the gear icon in the upper right hand navigation section
  • Click on Variables
  • Click the green New button on the right hand side
grafana_variables

This image already had a variable added, which is the devices

Once in the new screen you will have the following image:

grafana_add_variable

Here you will be defining a PromQL query to build out your device list. In the bottom section of the screen you see the heading of Preview of values. Here you will be able to observe a sample of what the query will result in for your variables.

The fields that you need to fill in include:

FieldInformation Needed
NameName of the variable you wish to use
TypeQuery
Data sourcePrometheus
RefreshWhen would you like to refresh the variables? Use the dropdown to select which fits your org best
QueryPromQL to get the data points
RegexRegex search to reduce the search results

You can experiment with the rest of the fields as you see fit to get your variables defined properly.

Once you have your search pattern set, make sure to click Save on the left hand side of Grafana.

To reference the variables once they are created, you use the dollar sign ($) in front of the name for Grafana to execute that as a variable within a query. Within the Legend area the use of Jinja-like formatting of the double curly braces will identify as a variable.

Grafana Plugins

Grafana is extensible via the use of plugins. There are quite a few plugins available for Grafana and I encourage you to take a look at the plugin page to be able to search for what you may want to use on your own. There are 3 types of plugins: Panel, Data Source, and App to help extend the Grafana capabilities.

Grafana Discrete Plugin (BGP State Over Time)

The next table to take a look at is using a feature within Grafana that allows you to add plugins. I’ll look at how we were able to build out the graph. This can help to identify issues quickly in your environment by just looking at the dashboard. Take a look at this where there was a BGP neighbor that was down. It is quickly identifiable on the dashboard and that action will be needed.

grafana_bgp_down

The two panels in the top row are using a Grafana plugin called Discrete. This provides data values in the color that is defined within the configuration over time. The panel then gives you the ability to hover over to see the changes over time. You install the plugin with the grafana-cli command:

grafana-cli plugins install natel-discrete-panel
sudo systemctl restart grafana

Once installed you can setup a new panel with the panel type of Discrete.

The panel will be created with the following parameters

BGP Session Status – Discrete Panel 1

Query: Prometheus data source

KeyValue
Metricsbgp_neighbor_session_state{device=”$device”}
Legend{{ device }} {{ neighbor_address }}
Min step
Resolution1/1
FormatTime series
InstantUnchecked

You will note that in the Metrics section the variable reference is $device, noted by the dollar sign in the device name. The Legend has two variables included in both the device and neighbor_address within the Legend. This is what gets displayed in the discrete table for each line.

grafana_discrete_page1
grafana_discrete_color_selection
grafana_discrete_value_mappings
Critical Interface State – Discrete Panel 2

Now because the interfaces have been assigned a label, a discrete panel can be generated to show the interface state, along with the role. For demonstration, we are naming this panel Critical Interfaces, the interfaces for Servers or Uplinks to other network devices have been labeled as with server or uplink accordingly. By querying for any role we can get thi information into the panel. The legend has the value of {{device}} {{name}} > {{intf_role}} > {{neighbor}} to get to the appropriate mappings that are to be shown. This is the resulting panel:

grafana_critical_intf_state

To get to this panel we can see the following discrete panel settings. This panel build is a little bit smaller, but gets a lot of information added into a panel!

grafana_intf_state_pg1
grafana_inft_state_txt_color1
grafana_intf_state_mappings

Device Dashboards vs Environment Dashboards

This is not a pick one over the other segment, rather this is saying that both should be present in your Grafana Dashboard setup.

In this post I have gone through and shown a lot of device-specific panels. The value here is that you are able to get to a device by device-specific view very quickly, without having to create a separate page for each and every device in your environment. That said, that the panels can be expanded by the use of variables to identify individual devices.

You should also look at using an environment dashboard where you are getting specific pieces of information to match your need. Need to know what an application performance looks like that includes Network, Server, Storage, and Application performance? You can work to build out these dashboards by hand, but this will take longer to build. As you leverage tags in the gathering of telemetry into your TSDB, you will be on your way to building dashboards in an automated fashion to get the big picture very quickly.


Conclusion

Hopefully this has been helpful. Again, check out the first post in the series if you need more information on these tools generally. In the next post, I will cover how to advance your Prometheus environment with monitoring remote sites and a I’ll discuss a couple of methodologies to enable alerting within the environment.

The next post will include how to alert using this technology stack.

-Josh



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!