Introduction to a Telemetry Stack – Part 3

Blog Detail

This is the third part of the telemetry stack introduction. In the first part, we discussed the stack big picture and how we collect data using Telegraf and network plugins such as SNMP and gNMI. In the second part, we addressed data normalization and enrichment. In this third part, we will get into alerting and observing the network.

Alerting is an art and a science. It is a science because it can be deterministic, based on profiling data, and subjected to strong statistical analysis. It is an art because it needs to be based on strong context, subject matter expertise, and sometimes, intuition. Alerting is encountered in almost any area in computing, such as information security, performance engineering, and of course, networking. There is a multiplicity of tools for generating alerts based on AI, machine learning, and other hot technologies. But, what makes a good alert? The answer is: triggering on symptoms not causes, simplicity, visualization that can point to root cause, and actionability.

In this blog, we analyze the alerting systems’ architecture, focus on how to generate meaningful alerts with Alertmanager, and how to create clean visualizations that help us point out alerts before they even get triggered. We start with basic definitions and move to the details of implementing alerts using the Telegraf, Prometheus, Grafana, Alertmanager (TPGA) stack.

Prerequisites

This blog is part of a series. You can read this independently of the series if you are familiar with the Telemetry stack TPG (Telegraf, Prometheus, Grafana) and the basics of collecting telemetry with modern techniques, such as streaming. However, you can start your journey from the beginning with Introduction to a Telemetry Stack – Part 1 and then Introduction to a Telemetry Stack – Part 2, which covers normalization and enrichment.

What Is an Alert?

An alert according to Merriam-Webster dictionary is: “an alarm or other signal of danger” or “an urgent notice.” That is exactly why an alert for a computing system has to be a meaningful signal of urgency and not constant white noise that is often ignored.

In computing, alerts are used to offer awareness of issues in a timely manner. Alerts may notify about the interruption of a service, an intrusion, or a violated baseline performance threshold. They are usually part of a monitoring system and can be paired with an automated action to reduce or eliminate the event that caused the alert.

Types of Alerts

There are two types of alerts:

  • Scheduled: Scheduled alerts occur at specific time periods. An example may be an alert for weekly maintenance of system patching.
  • Real-time: Real-time alerts are triggered by events. Events occur randomly, and therefore continuous monitoring is required to capture these.

Alert Triggers

The triggering events that generate alerts can be grouped in the following categories:

  • Status: This is a binary on/off trigger that indicates the status of a system. Context matters in the case of binary triggers regarding whether one should page a human or automation because of these alerts.
  • Threshold: These are continuous metrics that are based on the profile of normal operation. They are instantaneous violations of a continuous spectrum of values, e.g., CPU passed the threshold of 80%. Again, context matters here. Is this normal for the device or exceptional? Profiling helps define what normal operation is.
  • Aggregation: This trigger is similar to threshold, however in this case values are aggregated over a sliding time window. This can be a double-edged sword. On one hand, these triggers may offer a more complete picture in aggregating metrics for an alert. On the other hand, sliding windows have overlap, and this may cause unnecessary alerts.

How Does an Alerting System Work?

The figure below depicts how an alerting system works. The alert engine is the heart of the system and it takes three inputs: user-defined alert rules, database data related to events that can trigger the alerts, and silencing rules that are used to avoid unnecessary alerts. The output of the alert engine is a notification that is sent to outside systems, such as ChatOps, email, or incident management tools.

Alerting System Work

Metrics for a good alert?

Objective metrics are used to measure if an alert that adds value and is in turn actionable. These metrics are: the sensitivity and specificity. We define the sensitivity as “How many relevant events are reported by our alerts?” and measure it using the following formula: True_Positives / (True_Positives + False_Positives). Intuitively, if the sensitivity is high, our alert is pretty good, right? The more real alerts compared to “crying wolf” alerts, the better off we are with our pagees getting alerted and actually waking up to take care of business. We define specificity as True_Negatives / (True_Negatives + False_Negatives). Intuitively, this means that our alerts are detecting actual value and ignore the non-value adding events. In the figure below, the first half of the square calculates sensitivity and the second part specificity.

Metrics for a good alert

Implementing alerts with Alertmanager

In this section, we will review the TPGA stack used for alerting, then analyze the Alertmanager architecture, and finally we will demonstrate with examples how it can be used for alerting.

TPGA observability stack

We use the TPGA stack as seen in the figure below. We deploy two instances of Telegraf agent to collect the relevant data to our stack. This choice is common in network topologies, dedicating a lightweight agent for each device that is being monitored. In our case, each agent is monitoring an Arista cEOS router. The Telegraf gNMI plugin is used to gather interface operating status information and the execd plugin is used to capture BGP status. If you are not familiar with these plugin configurations, you can read the first part of the telemetry series. Prometheus is the Time Series Database (TSDB) of choice for its synergy with Alertmanager. Finally, Grafana is the visualization tool that we have selected since it specializes in time series depiction.

observability

What is Alertmanager?

Alertmanager is a meta-monitoring tool that uses the Prometheus TSDB events to generate alerts. Note that the Alertmanager is a separate instance from Prometheus with a good reason. First, scalability of multiple Prometheus instances and one Alertmanager instance can achieve centralization of events and avoid excessive notifications, i.e., noise. Second, decoupling of the Alertmanager maintains modularity in the design and functionality.

The Alertmanager has three main functions:

  • Grouping: Grouping is one of its most attractive features, since it contributes to reducing noise by combining multiple alarms and bundling them to one.
  • Inhibition: This is another function that aims at reducing noise by stopping sending error alarms once an initial alarm is issued.
  • Silences: Finally, silences stop sending repeated alarms within a time window.

Alertmanager has two main parts in its architecture: the router and the receiver. An alert passes through a routing tree, i.e., set of hierarchically organized rules, and then it is distributed to the corresponding receiver.

How to Configure Alertmanager and Prometheus?

First, we need to edit the configuration.yml file that has the basic configuration of Prometheus and add the following:

---
# other config

rule_files:
  - rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - alertmanager-01:9093

The rule files are a key to alerting, since this is where we place the alert rules in YAML syntax. In addition, the Alertmanager is defined by its name, in our case alertmanager-01, and the port 9093 where it listens. We can have a list of Alertmanager instances and rule locations.

Then the Alertmanager’s routes and receivers need to be configured in the alertmanager.yml configuration file:

---
global:
  resolve_timeout: 30m

route:
  receiver: empty_webhook
  routes:
  - group_by:
    - alertname
    match:
      source: testing
    receiver: empty_webhook


receivers:
- name: empty_webhook
  webhook_configs:
  - send_resolved: true
    url: http://localhost:9999

Note that we have added an empty route because, for now, our alert is not going to notify another system, such as a chat client or incident response. In the last part of the telemetry series, you will see how to configure the receivers and generate notifications.

Alert Use Case 1: Interface Down

First we will show the Grafana visualization tools that we can use to alert an operator that the interface is down. I have chosen two specific types of graphs in this case. The first is a table that indicates the status of interfaces, the second is a state timeline of the status of all interfaces that belong to a device. These graphs in themselves are a good way of alerting an operator. However, we want notifications and eventually actions, that is why we need the Alertmanager.

Interface Down

To configure the Alertmanager, we add the following rule in rules/device_rules.yml and based on the above configuration of Prometheus, this rule is included into its Alertmanager instance:

<span role="button" tabindex="0" data-code="groups: – name: Interface Down rules: – alert: InterfaceDown expr: interface_oper_status{} == 2 for: 1m labels: severity: critical source: stack environment: Production annotations: summary: "Interface is down" description: "Interface for host
groups:
  - name: Interface Down
    rules:
      - alert: InterfaceDown
        expr: interface_oper_status{} == 2
        for: 1m
        labels:
          severity: critical
          source: stack
          environment: Production
        annotations:
          summary: "Interface is down"
          description: "Interface for host <{{ $labels.instance }}> is down!"

This alert will fire after querying the Prometheus metric interface_oper_status and finding out that the state is down or equal to 2. Note that this rule will trigger every minute based on the keyword for. We can specify different labels for additional meta information and add a meaningful message in the description. Below you can see a short demo of how the alert fires.

Interface Down

Alert Use Case 2: BGP Neighbor Unreachable

Again, a picture is worth a thousand words. In our case, the Grafana graphs offer color coded information of what is in the BGP state. The state information can be found in the list below:

IDLE = 1
CONNECT = 2
ACTIVE = 3
OPENSENT = 4
OPENCONFIRM = 5
ESTABLISHED = 6
BGP Neighbor Unreachable

The configuration for this alert can also be placed in: rules/device_rules.yml.

<span role="button" tabindex="0" data-code="groups: – name: BGP Neighbor Down rules: – alert: BGPNeighborDown expr: bgp_session_state{device="ceos-01"} == 1 for: 1m labels: severity: warning source: stack environment: Production annotations: summary: "BGP Neighbor is down" description: "BGP Neighbor for host
groups:
  - name: BGP Neighbor Down
    rules:
      - alert: BGPNeighborDown
        expr: bgp_session_state{device="ceos-01"} == 1
        for: 1m
        labels:
          severity: warning
          source: stack
          environment: Production
        annotations:
          summary: "BGP Neighbor is down"
          description: "BGP Neighbor for host <{{ $labels.instance }}> is down!"

The difference of this alert is in the severity message, and as you can see, we are only interested in the ceos-01 device neighbors based on the Prometheus query. For more information about PromQL queries and syntax, you can reference one of my older blogs, Introduction to PromQL.

BGP Neighbor Unreachable

Recap & Announcement

We have reviewed the basics of alerting systems and how to configure Prometheus and Alertmanager. If you enjoyed this series of blogs for Telemetry, this is not the end! There is one more upcoming blog about advanced alerting techniques.


Conclusion

We have some exciting news for you as well. If you want to learn how to setup your own telemetry stacks and scale it in production grade environments by NTC automation experts, check the NEW course on telemetry deep dive by NTC training.

-Xenia

Resources



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

New Book – Open Source Network Management

Blog Detail

Earlier this month, I was able to hit the publish button on a new book – Open Source Network Management. The book dives into getting started with several open source network management tools. It is meant as a guide to help further your experience with using and installing open source tools, all on a single VM/host. The size of the host is meant to have minimal capital investment, in the way of a single NUC or a minimal VM deployed on a hypervisor in your environment.

The book is published on LeanPub, which is a publish early, publish often marketplace. The book is digital only, with PDF, ePub, and mobi formats available. Currently, the book is indicating 80% completeness, with most of the technical content in place already! There are mainly soft edits in this early version.

Projects

Several open source projects are covered in the book, starting out with installing Docker Community Edition (CE), then adding Docker Compose files to handle installation of the tools. After the Docker Compose is up, there is also a basic configuration to get up and running, actually using the project, including:

Current Projects Included in the Book:

  • Nautobot (Source of Truth)
  • Hashicorp Vault (Secrets Management)
  • Telegraf (Metrics Gathering)
  • Prometheus (Metrics Storage and Alerting)
  • Grafana (Metrics Visualization)
  • NGINX (Web Server/Reverse Proxy)

With these components in place, a modern network management stack can be assembled with minimal investment.

Projects Selection

These lightweight projects have the capability to run on a single host in order get up and running. Yet, even though these projects are lightweight, they all are able to scale out to meet the needs of Enterprise and Large Enterprise.

Planned Additions

Upcoming additions to the book include installing a Git application such as Gitea, adding in more Nautobot apps such as the Golden Configuration app (which requires a Git repo for configuration backup) and Welcome Wizard.

As time allows, more additions and tools will be added, such as those for looking at alternative metrics gathering solutions and other configuration backup solutions.

Opportunity to Get Your Own for Free

As part of the NTC desire to give back to the community, there is an opportunity to get your own copy of the book for free. To do so, join the Network to Code mailing list and select the Get my free copy! button.

There will be a limited quantity available.

There may be some delay in the code being sent to you.


Conclusion

Hopefully, the content in the book is helpful! I enjoyed putting it together!

-Josh



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Introduction to PromQL

Blog Detail

Time series databases and their query languages are tools with increasing popularity for a Network Automation Engineer. However, sometimes these tools may be overlooked by network operators for more “pressing” day-to-day workflow automation. Time series databases offer valuable network telemetry that will reveal important insights for network operations, such as security breaches, network outages, and slowdowns that degrade the user experience.

In this post, we will review the Prometheus Query Language (PromQL) to demonstrate the value and capabilities of processing time series. This review will offer use cases of PromQL for network engineers and data scientists.

What is Prometheus?

Prometheus is an open source systems monitoring and alerting toolkit. As you can see in the figure below, the heart of Prometheus includes a Time Series Database (TSDB) and the PromQL Engine. Exporters run locally on monitored hosts and export local metrics related to device health, such as CPU and memory utilization, and services, such as HTTP. The alert mechanism implemented with Prometheus, triggers alerts based on events and predefined thresholds. Prometheus has a web UI that we will be using in the examples of this post. In addition, the Prometheus measurements can be visualized using Grafana dashboards.

prometheus

Source: Prometheus Overview

What is a TSDB?

In simple words, it is a database that stores time series. Then, what is a time series? It is a set of time-stamps and their corresponding data. A TSDB is optimized to store these time series data efficiently, measure changes, and perform calculations over time. PromQL is the language that was built to retrieve data from the Prometheus TSDB. In networking, this could mean tracking the state of an interface or bandwidth utilization over time.

Why PromQL?

There are several other TSDBs, one of the most well known is InfluxDB. Both Prometheus TSDB and InfluxDB are excellent tools for telemetry and time series data manipulation. PromQL’s popularity has been growing fast because it is a comprehensive language to consume time series data. Multiple other solutions are starting to support PromQL, such as NewRelic that recently added support for PromQL and Timescale with Promscale.

Now that we have all the prerequisite knowledge we can dive deep into the PromQL data model and dissect language queries.

Prometheus Data Model

The first part of the Prometheus data model is the metric name. A metric name is uniquely identified, and it indicates what is being measured. A metric is a dimension of a specific feature. Labels are the second part of the data model. A label is a key-value pair that differentiates sub-dimensions in a metric.

Think of a metric, ex. interface_in_octets, as an object with multiple characteristics, ex., device_role. As you can see in the figure below, each label can pick a value for this characteristic, i.e. device_role="leaf". The combination of metrics and labels return a time series identifier, i.e., a list of tuples that provide the (timestamp, value) of the object with the specific characteristic. The timestamps are given in Unix time, milliseconds precision and the values that correspond to them are floating point type.

As a Network Automation Engineer you can think of many examples of metrics, such as interface_speedbgp_hold_timepackets_dropped, etc. All these metrics can be characterized by a variety of labels, such as device_platformhostinstanceinterface_name etc.

prometheus-data-model

With that data model in mind, let us next dissect a query in PromQL.

The anatomy of a query

The simplest form of a PromQL query may include just a metric. This query returns multiple single value vectors, as you can see below. All the applicable labels and value combinations that these labels can be assigned are given as a result of this simple query.

query-metric

Metrics

What kind of metrics does PromQL support? There are four kinds of metrics:

  1. Counters: these are metrics that can only increase, for example: interface counters, API call counters, etc.
  2. Gauges: the values of these metrics can go up and down, for example: bandwidth, latency, packets dropped, etc. Gauges and counters are useful for network engineers because they can measure already existent features of a system.
  3. Summaries: this metric is useful to data scientists and if your application includes data analytics. To use this metric you need have control of what you can measure and drill into additional details. A summary metric aggregates thousands of events to one metric. Specifically it counts observations and sums all the observed values. It can also calculate quantiles of these values. If you have an application that is being monitored, you can use the summaries for API request durations.
  4. Histograms: this is another metric that is more useful to a data scientist than a network engineer. Histogram metrics can be defined as summaries that are “bucketized”. Specifically they count observations and place them in configurable buckets. A histogram can be used to measure response sizes on an application.

Label Filtering

Now that we know what kinds of metrics we can include in our query, let us review how we can filter the query to retrieve more specific and meaningful results. This can be done with label filtering that includes the following operations:

# equal, returns interface speed for device with name jcy-bb-01
interface_speed{device="jcy-bb-01.infra.ntc.com"}
# not equal, returns the opposite of the above query
interface_speed{device!="jcy-bb-01.infra.ntc.com"}
# regex-match, matches interface Ethernet{1, 2, 3, 4, 5, 6, 7}
interface_speed{interface=~"Ethernet1/[1-7]"}
# not regex-match, returns the opposite of the above query
interface_speed{interface!~"Ethernet1/[1-7]"}

Not only can you use the equal and not equal signs to filter your queries, but you can filter using regular expressions. To learn more about regular expressions for network engineers, check our previous blog.

Functions

One of my favorite parts of PromQL are the functions that can manipulate the time series identifiers. Below, I include an example of the function rate(), that is useful for network metrics, and the function predict_linear(), that is useful if you perform data analytics.

How fast does a counter change?

The function rate() can be used with counter metrics to demonstrate how fast a counter increases. Specifically, it calculates the per second increase for a time period. This is a useful function to the network engineer, since counters are a common metric applied in networks. For example packet counting, interface octets counting are counters and the rate() function offers useful insights on how these counters increase.

#per second increase of counter averaged over 5 mins
rate(interface_in_octets{device_role="leaf"}[5m])

The next figure will help you understand the details of how the rate() function is calculated. The interval $\Delta$t indicates the time interval during which we want to calculate the rate. The X marks indicate the per second samples that are used to calculate multiple rates per second. The rate() function averages these calculations during the interval $\Delta$t. If the counter is reset to 0, the rate() function will extrapolate the sample as can be seen with the blue X marks.

rate

Instance vs. Range Vectors

You probably have noticed that the example of the rate() function above, uses a different type of syntax. Specifically it identifies the time series during an interval, from the example above the interval is 5 minutes ([5m]). This results to a range vector, where the time-series identifier returns the values for a given period, in this case 5 minutes. On the other hand, an instance vector returns one value, specifically the single latest value of a time series. The figures below shows the differences in the results of an instance vector versus a range vector.

#instance vector
interface_speed{device="jcy-bb-01.infra.ntc.com"}
query-instance-vector
#range vector
interface_speed{device="jcy-bb-01.infra.ntc.com"}[5m]
query-range-vector

In the first figure, only one value per vector is returned whereas in the second, multiple values that span in the range of 5 minutes are returned for each vector. The format of these values is: value@timestamp.

Offsets

You may be wondering: all of this is great, but where is the “time” in my “time-series”? The offset part of the query can retrieve data for a specific time interval. For example:

# interface speed status for the past 24 hrs
rate(interface_in_octets{device="jcy-bb-01.infra.ntc.com"}[5m] offset 24h)

Here we combine the function rate(), that samples the interface_in_octets counter every second for five minutes, with offset that gives us historical data for the past 24 hours.

Can I predict the next 24 hours?

Of course! PromQL provides the function predict_linear(), a simple machine learning model that predicts the value of a gauge in a given amount of time in the future, by using linear regression. This function is of more interest to a data scientist that wants to create forecasting models. For example, if you want to predict the disk usage in bytes within the next hour based on historic data, you would use the following query:

#predict disk usage bytes in an hour, using the last 15 mins of data
predict_linear(demo_disk_usage_bytes{job="demo"}[15m], 3600)

Linear regression fits a linear function to a set of random data points. This is achieved by searching for all possible values for the variables a, b that define a linear function f(x)=ax+b. The line that minimizes the mean Euclidean distance of all these data points is the result of the linear regression model, as you can see in the image below:

linear-regression

Aggregation

PromQL queries can be highly dimensional. This means that one query can return a set of time series identifiers for all the combinations of labels, as you can see below:

#multi-dimensional query
rate(demo_api_request_duration_seconds_count{job="demo"}[5m])
multi-dimensional

What if you want to reduce the dimensions to a more meaningful result, for example the sum of all the API request durations in seconds? This would result in a single-dimension query that is the result of adding multiple instance vectors together:

#one-dimensional query, add instance vectors
sum(rate(demo_api_request_duration_seconds_count{job="demo"}[5m]))
one-dimensional

You may choose to aggregate over specific dimensions using labels and the function by(). In the example below, we perform a sum over all instances, paths, and jobs. Note the reduction of the number of vectors returned:

# multi-dimensional query - by()
sum by(instance, path, job) (rate(demo_api_request_duration_seconds_count{job="demo"}[5m]))
sum-by

We can perform the same query excluding labels using the function without():

sum without(method, status) (rate(demo_api_request_duration_seconds_count{job="demo"}[5m]))

This results to the same set of instance vectors:

sum-without

Additional aggregation over dimensions can be done with the following functions:

  • min(): selects the minimum of all values within an aggregated group.
  • max(): selects the maximum of all values within an aggregated group.
  • avg(): calculates the average (arithmetic mean) of all values within an aggregated group.
  • stddev(): calculates the standard deviation of all values within an aggregated group.
  • stdvar(): calculates the standard variance of all values within an aggregated group.
  • count(): calculates the total number of series within an aggregated group.
  • count_values(): calculates number of elements with the same sample value.

Useful Resources


Conclusion

Thank you for taking this journey with me, learning about the time series query language, PromQL. There are many more features to this language such as arithmetic, sorting, set functions etc. I hope that this post has given you the opportunity to understand the basics of PromQL, see the value of telemetry and TSDBs, and that it has increased your curiosity to learn more.

-Xenia



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!