Introduction to a Telemetry Stack – Part 3

Blog Detail

This is the third part of the telemetry stack introduction. In the first part, we discussed the stack big picture and how we collect data using Telegraf and network plugins such as SNMP and gNMI. In the second part, we addressed data normalization and enrichment. In this third part, we will get into alerting and observing the network.

Alerting is an art and a science. It is a science because it can be deterministic, based on profiling data, and subjected to strong statistical analysis. It is an art because it needs to be based on strong context, subject matter expertise, and sometimes, intuition. Alerting is encountered in almost any area in computing, such as information security, performance engineering, and of course, networking. There is a multiplicity of tools for generating alerts based on AI, machine learning, and other hot technologies. But, what makes a good alert? The answer is: triggering on symptoms not causes, simplicity, visualization that can point to root cause, and actionability.

In this blog, we analyze the alerting systems’ architecture, focus on how to generate meaningful alerts with Alertmanager, and how to create clean visualizations that help us point out alerts before they even get triggered. We start with basic definitions and move to the details of implementing alerts using the Telegraf, Prometheus, Grafana, Alertmanager (TPGA) stack.

Prerequisites

This blog is part of a series. You can read this independently of the series if you are familiar with the Telemetry stack TPG (Telegraf, Prometheus, Grafana) and the basics of collecting telemetry with modern techniques, such as streaming. However, you can start your journey from the beginning with Introduction to a Telemetry Stack – Part 1 and then Introduction to a Telemetry Stack – Part 2, which covers normalization and enrichment.

What Is an Alert?

An alert according to Merriam-Webster dictionary is: “an alarm or other signal of danger” or “an urgent notice.” That is exactly why an alert for a computing system has to be a meaningful signal of urgency and not constant white noise that is often ignored.

In computing, alerts are used to offer awareness of issues in a timely manner. Alerts may notify about the interruption of a service, an intrusion, or a violated baseline performance threshold. They are usually part of a monitoring system and can be paired with an automated action to reduce or eliminate the event that caused the alert.

Types of Alerts

There are two types of alerts:

  • Scheduled: Scheduled alerts occur at specific time periods. An example may be an alert for weekly maintenance of system patching.
  • Real-time: Real-time alerts are triggered by events. Events occur randomly, and therefore continuous monitoring is required to capture these.

Alert Triggers

The triggering events that generate alerts can be grouped in the following categories:

  • Status: This is a binary on/off trigger that indicates the status of a system. Context matters in the case of binary triggers regarding whether one should page a human or automation because of these alerts.
  • Threshold: These are continuous metrics that are based on the profile of normal operation. They are instantaneous violations of a continuous spectrum of values, e.g., CPU passed the threshold of 80%. Again, context matters here. Is this normal for the device or exceptional? Profiling helps define what normal operation is.
  • Aggregation: This trigger is similar to threshold, however in this case values are aggregated over a sliding time window. This can be a double-edged sword. On one hand, these triggers may offer a more complete picture in aggregating metrics for an alert. On the other hand, sliding windows have overlap, and this may cause unnecessary alerts.

How Does an Alerting System Work?

The figure below depicts how an alerting system works. The alert engine is the heart of the system and it takes three inputs: user-defined alert rules, database data related to events that can trigger the alerts, and silencing rules that are used to avoid unnecessary alerts. The output of the alert engine is a notification that is sent to outside systems, such as ChatOps, email, or incident management tools.

Alerting System Work

Metrics for a good alert?

Objective metrics are used to measure if an alert that adds value and is in turn actionable. These metrics are: the sensitivity and specificity. We define the sensitivity as “How many relevant events are reported by our alerts?” and measure it using the following formula: True_Positives / (True_Positives + False_Positives). Intuitively, if the sensitivity is high, our alert is pretty good, right? The more real alerts compared to “crying wolf” alerts, the better off we are with our pagees getting alerted and actually waking up to take care of business. We define specificity as True_Negatives / (True_Negatives + False_Negatives). Intuitively, this means that our alerts are detecting actual value and ignore the non-value adding events. In the figure below, the first half of the square calculates sensitivity and the second part specificity.

Metrics for a good alert

Implementing alerts with Alertmanager

In this section, we will review the TPGA stack used for alerting, then analyze the Alertmanager architecture, and finally we will demonstrate with examples how it can be used for alerting.

TPGA observability stack

We use the TPGA stack as seen in the figure below. We deploy two instances of Telegraf agent to collect the relevant data to our stack. This choice is common in network topologies, dedicating a lightweight agent for each device that is being monitored. In our case, each agent is monitoring an Arista cEOS router. The Telegraf gNMI plugin is used to gather interface operating status information and the execd plugin is used to capture BGP status. If you are not familiar with these plugin configurations, you can read the first part of the telemetry series. Prometheus is the Time Series Database (TSDB) of choice for its synergy with Alertmanager. Finally, Grafana is the visualization tool that we have selected since it specializes in time series depiction.

observability

What is Alertmanager?

Alertmanager is a meta-monitoring tool that uses the Prometheus TSDB events to generate alerts. Note that the Alertmanager is a separate instance from Prometheus with a good reason. First, scalability of multiple Prometheus instances and one Alertmanager instance can achieve centralization of events and avoid excessive notifications, i.e., noise. Second, decoupling of the Alertmanager maintains modularity in the design and functionality.

The Alertmanager has three main functions:

  • Grouping: Grouping is one of its most attractive features, since it contributes to reducing noise by combining multiple alarms and bundling them to one.
  • Inhibition: This is another function that aims at reducing noise by stopping sending error alarms once an initial alarm is issued.
  • Silences: Finally, silences stop sending repeated alarms within a time window.

Alertmanager has two main parts in its architecture: the router and the receiver. An alert passes through a routing tree, i.e., set of hierarchically organized rules, and then it is distributed to the corresponding receiver.

How to Configure Alertmanager and Prometheus?

First, we need to edit the configuration.yml file that has the basic configuration of Prometheus and add the following:

---
# other config

rule_files:
  - rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - alertmanager-01:9093

The rule files are a key to alerting, since this is where we place the alert rules in YAML syntax. In addition, the Alertmanager is defined by its name, in our case alertmanager-01, and the port 9093 where it listens. We can have a list of Alertmanager instances and rule locations.

Then the Alertmanager’s routes and receivers need to be configured in the alertmanager.yml configuration file:

---
global:
  resolve_timeout: 30m

route:
  receiver: empty_webhook
  routes:
  - group_by:
    - alertname
    match:
      source: testing
    receiver: empty_webhook


receivers:
- name: empty_webhook
  webhook_configs:
  - send_resolved: true
    url: http://localhost:9999

Note that we have added an empty route because, for now, our alert is not going to notify another system, such as a chat client or incident response. In the last part of the telemetry series, you will see how to configure the receivers and generate notifications.

Alert Use Case 1: Interface Down

First we will show the Grafana visualization tools that we can use to alert an operator that the interface is down. I have chosen two specific types of graphs in this case. The first is a table that indicates the status of interfaces, the second is a state timeline of the status of all interfaces that belong to a device. These graphs in themselves are a good way of alerting an operator. However, we want notifications and eventually actions, that is why we need the Alertmanager.

Interface Down

To configure the Alertmanager, we add the following rule in rules/device_rules.yml and based on the above configuration of Prometheus, this rule is included into its Alertmanager instance:

<span role="button" tabindex="0" data-code="groups: – name: Interface Down rules: – alert: InterfaceDown expr: interface_oper_status{} == 2 for: 1m labels: severity: critical source: stack environment: Production annotations: summary: "Interface is down" description: "Interface for host
groups:
  - name: Interface Down
    rules:
      - alert: InterfaceDown
        expr: interface_oper_status{} == 2
        for: 1m
        labels:
          severity: critical
          source: stack
          environment: Production
        annotations:
          summary: "Interface is down"
          description: "Interface for host <{{ $labels.instance }}> is down!"

This alert will fire after querying the Prometheus metric interface_oper_status and finding out that the state is down or equal to 2. Note that this rule will trigger every minute based on the keyword for. We can specify different labels for additional meta information and add a meaningful message in the description. Below you can see a short demo of how the alert fires.

Interface Down

Alert Use Case 2: BGP Neighbor Unreachable

Again, a picture is worth a thousand words. In our case, the Grafana graphs offer color coded information of what is in the BGP state. The state information can be found in the list below:

IDLE = 1
CONNECT = 2
ACTIVE = 3
OPENSENT = 4
OPENCONFIRM = 5
ESTABLISHED = 6
BGP Neighbor Unreachable

The configuration for this alert can also be placed in: rules/device_rules.yml.

<span role="button" tabindex="0" data-code="groups: – name: BGP Neighbor Down rules: – alert: BGPNeighborDown expr: bgp_session_state{device="ceos-01"} == 1 for: 1m labels: severity: warning source: stack environment: Production annotations: summary: "BGP Neighbor is down" description: "BGP Neighbor for host
groups:
  - name: BGP Neighbor Down
    rules:
      - alert: BGPNeighborDown
        expr: bgp_session_state{device="ceos-01"} == 1
        for: 1m
        labels:
          severity: warning
          source: stack
          environment: Production
        annotations:
          summary: "BGP Neighbor is down"
          description: "BGP Neighbor for host <{{ $labels.instance }}> is down!"

The difference of this alert is in the severity message, and as you can see, we are only interested in the ceos-01 device neighbors based on the Prometheus query. For more information about PromQL queries and syntax, you can reference one of my older blogs, Introduction to PromQL.

BGP Neighbor Unreachable

Recap & Announcement

We have reviewed the basics of alerting systems and how to configure Prometheus and Alertmanager. If you enjoyed this series of blogs for Telemetry, this is not the end! There is one more upcoming blog about advanced alerting techniques.


Conclusion

We have some exciting news for you as well. If you want to learn how to setup your own telemetry stacks and scale it in production grade environments by NTC automation experts, check the NEW course on telemetry deep dive by NTC training.

-Xenia

Resources



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Intro to Pandas (Part 3) – Forecasting the Network

Blog Detail

Forecasting is a fascinating concept. Who does not want to know the future? Oracles from the ancient times, a multitude of statistical forecasting models, and machine learning prediction algorithms have one thing in common: the thirst to know what is going to happen next. As fascinating as forecasting is, it is not an easy conquest. There are phenomena that can be predicted, because we understand what causes them and we have a large amount of historical data. An example is electricity consumption: it exhibits seasonality and predictability. On the other hand, there are phenomena that are difficult to predict, such as market trends that depend on human emotion and unpredictable world events (wars for example).

Where does the network fall in the spectrum of forecasting ease and accuracy? How easily and effectively can we predict the next outage, a big dip in performance, or an anomaly that may point to an attack? Starting from the assumption that we have a large amount of data (and events mostly depend on machine behavior), the network can be quite predictable. A variety of events, such as outages, are predictable—some planned and some caused by happenstances, such as an overload or human error.

As any human, the network engineer would like to have an oracle at their disposal to let them know about the future occurrence of important events. Deciding on the size and availability of network resources based on forecasting traffic and usage models, knowing how often one should update or reconfigure with minimal disruption, and planning maintenances based on traffic patterns are some powerful use cases for a network operator. Hence this blog, which gives programmatic tools for the network engineer to automate forecasting of the network with Python Pandas.

Prerequisites

This blog is part of a series. You can read this independently of the series if you are familiar with Pandas and how to use Jupyter notebooks. However, you can start your journey from the beginning, especially if you want to actively read and work out the examples. I recommend starting with Jupyter Notebooks for Development and then Introduction to Pandas for Network Development. You can also read the Intro to Pandas (Part 2) – Exploratory data analysis for network traffic, however this part is not necessary in order to understand forecasting.

What Is Statistical Forecasting?

Statistical forecasting is the act of creating a model to predict future events based on past experience with a certain degree of uncertainty. In this blog, we will focus on statistical forecasting methods. A variety of machine learning forecasting is analyzed in other blogs; however simple is better, as has been shown by studies for the past 40 years in the M competition and analysis. Statistical methods are less computationally complex, and the best Machine Learning fitting methods are not always optimal for forecasting.

Basic Forecasting Methods

Below is a list of basic forecasting methods and their definitions:

  • Straight line: this is a naive prediction that uses historical figures to predict growth and only applies to an upward trend.
  • Moving averages: one of the most popular methods that takes into account the pattern of data to estimate future values. A well known implementation of moving averages is the Auto Regressive Integrated Moving Average (ARIMA).
  • Linear regression: in this case as well, a straight line is fitted to the data; however this time it can predict upward or downward trends.
  • Multiple linear regression: if we want to use two or more variables to predict the future of another variable, for example use holidays and latency to predict network traffic patterns, multiple linear regression is our friend.

We will review implementations of the two most popular techniques: moving averages and linear regression with Pandas libraries.

How to Implement Forecasting

These basic steps are part of almost every forecasting implementation:

  • Preprocessing: it may include removing NaN, adding metadata, or splitting your data in two distinct parts: the training data, which is used to make predictions, and the test data, which is used to validate predictions. Splitting your data is a whole article or two on its own: should you split data in half, in random chunks, etc.
  • Pick your poison…ehm…model: this may be the most difficult part, and some Exploratory Data Analysis may be required to pick a good algorithm.
  • Analyze the results: analysis is usually performed visually with graphical methods.
  • Iterate: periodic fine-tuning of the forecasting method may include changing algorithm parameters.

Forecasting the Network Example

Now that we know the basics about the theory of forecasting, let’s implement all the steps and apply moving averages and linear regression to a network dataset.

Dataset

The dataset that we will use is a the Network Anomaly Detection Dataset. It includes Simple Network Management Protocol (SNMP) monitoring data. SNMP is the de facto protocol when it comes to telemetry for network appliances and can track a variety of interesting data related to machine performance, such as bytes in/out, errors, packets, connection hits, etc.

You will find the code referenced in the examples at the Pandas Blog GitHub repository.

Preprocessing

Preprocessing of the data includes cleaning and adding metadata. We need to add dates to this specific dataset.

We begin with the necessary imports and loading the csv file to a Pandas data frame:

import numpy as np
import pandas as pd

network_data = pd.read_csv("../data/network_data.csv")
network_data.columns

Index(['ifInOctets11', 'ifOutOctets11', 'ifoutDiscards11', 'ifInUcastPkts11',
       'ifInNUcastPkts11', 'ifInDiscards11', 'ifOutUcastPkts11',
       'ifOutNUcastPkts11', 'tcpOutRsts', 'tcpInSegs', 'tcpOutSegs',
       'tcpPassiveOpens', 'tcpRetransSegs', 'tcpCurrEstab', 'tcpEstabResets',
       'tcp?ActiveOpens', 'udpInDatagrams', 'udpOutDatagrams', 'udpInErrors',
       'udpNoPorts', 'ipInReceives', 'ipInDelivers', 'ipOutRequests',
       'ipOutDiscards', 'ipInDiscards', 'ipForwDatagrams', 'ipOutNoRoutes',
       'ipInAddrErrors', 'icmpInMsgs', 'icmpInDestUnreachs', 'icmpOutMsgs',
       'icmpOutDestUnreachs', 'icmpInEchos', 'icmpOutEchoReps', 'class'],
      dtype='object')

The table column titles printed above include characteristic SNMP data (such as TCP active open connections, input/output packets, and UDP input/output datagrams) that offer a descriptive picture of performance status and potential anomalies in network traffic. After this we can add a date column or any other useful metadata. Let’s keep it simple here and add dates spaced evenly to days using a column of our data, the ipForwDatagrams:

dates = pd.date_range('2022-03-01', periods=len(network_data["ipForwDatagrams"]))

We are ready to review the fun part of forecasting, by implementing Moving Average.

Moving Average

Pandas has a handy function called rolling that can shift through a window of data points and perform a function on them such as an average or min/max function. Think of it as a sliding window for data frames, but the slide is always of size 1 and the window size is the first parameter in the rolling function. For example, if we set this parameter to 5 and the function to average, we will calculate 5 averages in a dataset with 10 data points. This example is illustrated in the following figure, where we have marked the first three calculations of averages:

 Moving Average

How does this fit with forecasting? We can use historic data (last 5 data points in the above example), to predict the future! Every new average from this rolling function, gives a trend for what is coming next. Let’s make this concrete with an example.

First we create a new data frame that includes our metadata dates and the value we want to predict, ipForwDatagrams:

df = pd.DataFrame(data=zip(dates, network_data["ipForwDatagrams"]), columns=['Date', 'ipForwDatagrams'])
df.head()

Date ipForwDatagrams
0 2022-03-01 59244345
1 2022-03-02 59387381
2 2022-03-03 59498140
3 2022-03-04 59581345
4 2022-03-05 59664453

Then we use the rolling average. We apply it on the IP forward Datagrams column, ipForwDatagrams, to calculate a rolling average every 1,000 data points. This way we use historic data to create a trend line, a.k.a. forecasting!

df["rolling"] = df["ipForwDatagrams"].rolling(1000, center=True).mean()

Finally, we will visualize the predictions:

# Plotting the effect of a rolling average
import matplotlib.pyplot as plt
plt.plot(df['Date'], df['ipForwDatagrams'])
plt.plot(df['Date'], df['rolling'])
plt.title('Data With Rolling Average')

plt.show()
Moving Average

The orange line represents our moving average prediction and it seems to be doing pretty well. You may notice that it does not follow the spikes in the data, it is much smoother. If you experiment with the granularity, i.e., smaller than 1,000 rolling window, you will see an improvement in predictions with loss to additional computations.

Linear Regression

Linear regression fits a linear function to a set of random data points. This is achieved by searching for all possible values for the variables ab that define a line function y = a * x + b. The line that minimizes the distance from the dataset data points is the result of the linear regression model.

Let’s see if we can calculate a linear regression predictor for our SNMP dataset. In this case, we will not use time series data; we will consider the relationship, and as a consequence the predictability, of a variable using another. The variable that we consider as a known, or historic data, is the TCP input segments tcpInSegs. The variable that we are aiming to predict is the output segments, tcpOutSegs. Linear Regression is implemented by linear_model in the sklearn library, a powerful tool for data science modeling. We set the x var to tcpInSegs column from the SNMP dataset and the y var to tcpOutSegs. Our goal is to define the function y = a * x + b, specifically a and b constants, to determine a line that predicts the trend of output segments when we know the input segments:

from sklearn import linear_model
import matplotlib.pyplot as plt

x = pd.DataFrame(network_data['tcpInSegs'])
y = pd.DataFrame(network_data['tcpOutSegs'])
regr = linear_model.LinearRegression()
regr.fit(x, y)

The most important part of the above code is the use of linear_model.LinearRegression() function that does its magic behind the scenes and returns a regr object. This object gives us a function of ab variables, that can be used to forecast the number of TCP out segments based on the number of input TCP segments. If you do not believe me, here is the plotted result:

plt.scatter(x, y,  color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Linear Regression

The blue line indicates our prediction, and if you ask me, it is pretty good. Now how about trying to predict IP input received, ipInReceives, from ICMP input messages (icmpInMsgs)? Would we achieve such good forecasting? Let’s just change the x and y variables and find out:

x = pd.DataFrame(network_data['icmpInMsgs'])
y = pd.DataFrame(network_data['ipInReceives'])
regr = linear_model.LinearRegression()
regr.fit(x, y)

We use the same code as above to generate the plot. This one does not look nearly as accurate. However, the blue line indicates the decreasing trend of the IP in received packets based on ICMP inputs. That is a good example of where another forecasting algorithm could be used, such as dynamic regression or a nonlinear model.

Linear Regression

Conclusion

We have reviewed two of the most popular forecasting methodologies, moving averages and linear regression, with Python Pandas. We have noticed the benefits and accuracy of forecasting as well as its weaknesses.

This concludes the Pandas series for Network Automation Engineers. I hope you have enjoyed this as much as I have and added useful tools for your ever growing toolbox.

-Xenia

Resources



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Intro to Pandas (Part 2) – Exploratory data analysis for network traffic

Blog Detail

Data analytics is an important skill for every engineer and even more the Network Engineer that goes through large amounts of data for troubleshooting. Networking companies have been moving towards data science integrations for appliances and software. The Arista EOS Network Data Lake is a characteristic example where Artificial Intelligence and Machine Learning are used to analyze data from different resources and lead to actionable decisions.

This blog aims to develop these skills, and it is a part of a series related to data analysis for Network Engineers. The first part was a detailed introduction on how to use Pandas, a powerful Python Data Science framework, to analyze networking data. The second part included instructions on how to run the code in these blogs using Jupyter notebooks and the Poetry virtual environment. This third blog is going deeper into how we can explore black-box networking data with a powerful analysis technique, Exploratory Data Analysis (EDA). Naturally, we will be using Pandas and Jupyter notebooks in our examples.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of statistical and graphical techniques to make sense of any type of black-box data.

Goals of EDA

EDA aims at accomplishing the following goals:

  • Tailor a good fitting: Matching your data as closely as possible to a distribution described by a mathematical expression has several benefits, such as predicting the next failure in your network.
  • Find outliers: Outliers are these odd data points that lie outside of a group of data. An example is a web server farm where all port requests are aimed at specific ports and every once in a while a random port is requested. An outlier may be caused by error or intentional testing and even adversarial attack behavior.
  • Create a list of ranked important factors: Removing unwanted features or finding the most important ones is called dimensionality reduction in data science terms. For example, with EDA you will be able to distinguish using statistical metrics a subset of the most important features or appliances that may affect your network performance, outages, and errors.
  • Discover optimal settings: How many times have you wondered how it would be if you could fine-tune your BGP timers, MTUs, or bandwidth allocation and not just guess these values? EDA helps discover the best value for networking settings.

Why EDA

EDA has been proven an invaluable tool for Data Scientists, why not for Network Engineers? We gather a lot of network data, such as packet captures, syslogs, etc., that we do not know how to make sense of or how to mine their value. Even though we use out-of-box data analytics tools, such as Splunk, the insight of the Network Engineer that stems from building a model and processing raw data, is invaluable.

How to Implement EDA

To implement EDA, you need tools that you probably use in your day-to-day network operations and did not know they were part of EDA:

  • Strong graphical analysis: from single variable plots to time series, and multi-variable plots, there is a graphical tool in EDA that fits your problem.
  • Statistics: this may include hypothesis testing, calculations of summary statistics, metrics for scale, and the shape of your data.

We will explore these techniques with a network dataset in the next section.

EDA for Network Data

In this section, we will review data preprocessing with graphical and statistical analysis EDA techniques.

Dataset

The dataset we will use is a 5GB packet capture of Operating System (OS) scans from the Kitsune Network Attack Dataset. You will find the code referenced below in the Pandas Blog GitHub repository.

Preprocessing

Pre-processing of the data includes cleaning and adding metadata. We will add useful metadata to our dataset.

We start with the necessary imports and reading the csv file to a Pandas data frame:

import numpy as np
import pandas as pd

os_scan_data = pd.read_csv("../data/OS_Scan_dataset.csv")
os_scan_data

That will print a subset of the data since the file is too large:

For more information about Pandas data frames, please check the Intro to Pandas blog post.

Then, we will create metadata timestamp objects using the to_datetime function:

import datetime

timestamps = pd.to_datetime(os_scan_data["Time"], format='%Y-%m-%d %H:%M:%S.%f')
os_scan_data["Time"] = timestamps
print("Timestamps")
print(timestamps)

The timestamps are shown below:

Timestamps
0         2017-08-07 08:17:12.597437
1         2017-08-07 08:17:12.597474
2         2017-08-07 08:17:12.597553
3         2017-08-07 08:17:12.597558
4         2017-08-07 08:17:12.597679
                     ...            
1697846   2017-08-07 09:09:25.354194
1697847   2017-08-07 09:09:25.354321
1697848   2017-08-07 09:09:25.354341
1697849   2017-08-07 09:09:25.354358
1697850   2017-08-07 09:09:25.354493
Name: Time, Length: 1697851, dtype: datetime64[ns]

Finally, we will calculate interesting derivative data, such as the packet interarrivals. To this end, we will use the numpy function np.diff that takes as input a column of numbers and subtracts its rows in pairs:

interarrival_times = np.diff(timestamps)
interarrival_times

The packet interarrival values are printed below:

array([ 37000,  79000,   5000, ...,  20000,  17000, 135000],
      dtype='timedelta64[ns]')

We append the array to the os_scan_data type, casting it to int, and print the columns of the dataset to verify that the Interarrivals column has been appended:

interarrival_times = np.append(interarrival_times, [0])
os_scan_data["Interarrivals"] = interarrival_times.astype(int)
os_scan_data.columns

Below are the column names of our data after the preprocessing:

Index(['No.', 'Time', 'Source', 'Destination', 'Protocol', 'Length', 'Info',
       'Src Port', 'Dst Port', 'Interarrivals'],
      dtype='object')

Now we are ready to create pretty graphs!

Graphical Analysis

In this section, we will focus on two graphical techniques from EDA: histograms and scatter plots. We will demonstrate how to combine the information with jointplots to analyze black-box datasets.

Histogram

The first graph that we will make may not be pretty, however it demonstrates the value and flexibility of Pandas and graphical analysis for data exploration:

os_scan_data.hist(column=["Length", "Interarrivals", "Src Port", "Dst Port"])

With a single line of code and the power of Pandas data frames, we already have a set of meaningful plots. The histogram offers a graphical summary of the distribution of a single variable dataset. In the above histograms, we see how the values of LengthInterarrivalsSrc Port, and Dst Port are distributed, i.e., spread, into a continuous interval of values.

Histograms offer an insight to the shape of our data and they can be fine tuned to give us a better point of view. The main “ingredient” of the histogram is a bin; a bin corresponds to the bars that you see in the graphs above and its height indicates the number of elements that fall within a range of values. The default size of bins in the data frame hist function is 10. For example, a bin of size (width) 10 and height 1000 indicates that there are 1000 values x within the range: 0 <= x < 10. Modifying the bin size is a powerful technique to get additional granularity or a “big picture” view of the data distribution:

os_scan_data.hist(column='Length', bins=20)
os_scan_data.hist(column='Length', bins=100)
hist-length-20hist-length

There is a whole science in how to fine-tune a histogram’s bin size. A good rule of thumb is that if you have dense data, a large size will give you a good “bird’s-eye view”. In the case of packet lengths, we have sparse of data. Therefore, the smaller bin helps us distinguish the data shape.

Scatter Plot

A scatter plot is another common graphical tool of EDA. Using a scatter plot, we are plotting two variables against each other with the goal of correlating their values:

os_scan_data.plot.scatter(x='Interarrivals', y='Src Port')
os_scan_data.plot.scatter(x='Interarrivals', y='Dst Port')

The story narrated by these two graphs is that packet interarrival values to source ports have a wider spread, i.e. 0..2 x 10^7, whereas for destination ports these values have half the spread. That may point to slow response or a high speed scan, such as an OS scan! Part of the story is a high usage of low source and destination port numbers. This may point to OS services running on these ports, that are targeted on a wide spread of intervals.

Joint Plots

Now let’s combine the scatter and histogram plots for additional insight into our data. We will use an additional plotting packageseaborn:

<span role="button" tabindex="0" data-code="import matplotlib.pyplot as plt import seaborn as sns short_interarrivals = os_scan_data[(os_scan_data['Interarrivals']
import matplotlib.pyplot as plt
import seaborn as sns

short_interarrivals = os_scan_data[(os_scan_data['Interarrivals'] < 10000) & (os_scan_data['Interarrivals'] > 0)]
sns.jointplot(x='Interarrivals', y='Dst Port', kind='hex', data=short_interarrivals)
sns.jointplot(x='Interarrivals', y='Dst Port', kind='kde', data=short_interarrivals)
plt.show()

Note that we used the power of Pandas data frame to define a new frame short_intervals, where we take interarrivals that are less than 10K nanoseconds. The hex type plot resembles a scatter plot with histograms on the sides. The color coding of the data points indicates higher concentration of values in this specific area. The kde (Kernel Distribution Estimate) gives a distribution similar to a histogram, however the centralizing values, i.e., kernels, are visualized as well. The three distinct parts of the graph in kde will be described with three different mathematical distributions.

Summary Statistics

Summary statistics as part of EDA are extremely useful when dealing with a large set of data:

short_interarrivals.describe()

With a single line of code, the describe Pandas function gives us several statistics such as percentiles, min, max values, etc. These statistics can lead to distribution fitting and additional insights into the data.

Autocorrelation

Finally, autocorrelation calculations show how much the values within a series, i.e., the length or interarrival values, are related:

length_series = os_scan_data["Length"]
length_series.autocorr()  
0.3938818297281779

interarrival_series = os_scan_data["Interarrivals"]
interarrival_series.autocorr() 
-0.031230988268827732

In this case the packet lengths are positively correlated, which means that if a value is above average, the next value will likely be above average. Negative autocorrelation such as the one that is observed for packet interarrivals, means that if an interarrival is above average, the next interarrival will likely be below average. This is a powerful metric for predictions.


Conclusion

We have reviewed how to use EDA techniques to extract useful information from black-box data. This part of the series data analytics for Network Engineers, offers a deeper understanding of the power of the Pandas library and the statistical techniques that you can implement with it. In the last part of the series, we will review some predictive models. Stay tuned!

-Xenia

Resources



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!