Intro to Pandas (Part 3) – Forecasting the Network

Blog Detail

Forecasting is a fascinating concept. Who does not want to know the future? Oracles from the ancient times, a multitude of statistical forecasting models, and machine learning prediction algorithms have one thing in common: the thirst to know what is going to happen next. As fascinating as forecasting is, it is not an easy conquest. There are phenomena that can be predicted, because we understand what causes them and we have a large amount of historical data. An example is electricity consumption: it exhibits seasonality and predictability. On the other hand, there are phenomena that are difficult to predict, such as market trends that depend on human emotion and unpredictable world events (wars for example).

Where does the network fall in the spectrum of forecasting ease and accuracy? How easily and effectively can we predict the next outage, a big dip in performance, or an anomaly that may point to an attack? Starting from the assumption that we have a large amount of data (and events mostly depend on machine behavior), the network can be quite predictable. A variety of events, such as outages, are predictable—some planned and some caused by happenstances, such as an overload or human error.

As any human, the network engineer would like to have an oracle at their disposal to let them know about the future occurrence of important events. Deciding on the size and availability of network resources based on forecasting traffic and usage models, knowing how often one should update or reconfigure with minimal disruption, and planning maintenances based on traffic patterns are some powerful use cases for a network operator. Hence this blog, which gives programmatic tools for the network engineer to automate forecasting of the network with Python Pandas.

Prerequisites

This blog is part of a series. You can read this independently of the series if you are familiar with Pandas and how to use Jupyter notebooks. However, you can start your journey from the beginning, especially if you want to actively read and work out the examples. I recommend starting with Jupyter Notebooks for Development and then Introduction to Pandas for Network Development. You can also read the Intro to Pandas (Part 2) – Exploratory data analysis for network traffic, however this part is not necessary in order to understand forecasting.

What Is Statistical Forecasting?

Statistical forecasting is the act of creating a model to predict future events based on past experience with a certain degree of uncertainty. In this blog, we will focus on statistical forecasting methods. A variety of machine learning forecasting is analyzed in other blogs; however simple is better, as has been shown by studies for the past 40 years in the M competition and analysis. Statistical methods are less computationally complex, and the best Machine Learning fitting methods are not always optimal for forecasting.

Basic Forecasting Methods

Below is a list of basic forecasting methods and their definitions:

  • Straight line: this is a naive prediction that uses historical figures to predict growth and only applies to an upward trend.
  • Moving averages: one of the most popular methods that takes into account the pattern of data to estimate future values. A well known implementation of moving averages is the Auto Regressive Integrated Moving Average (ARIMA).
  • Linear regression: in this case as well, a straight line is fitted to the data; however this time it can predict upward or downward trends.
  • Multiple linear regression: if we want to use two or more variables to predict the future of another variable, for example use holidays and latency to predict network traffic patterns, multiple linear regression is our friend.

We will review implementations of the two most popular techniques: moving averages and linear regression with Pandas libraries.

How to Implement Forecasting

These basic steps are part of almost every forecasting implementation:

  • Preprocessing: it may include removing NaN, adding metadata, or splitting your data in two distinct parts: the training data, which is used to make predictions, and the test data, which is used to validate predictions. Splitting your data is a whole article or two on its own: should you split data in half, in random chunks, etc.
  • Pick your poison…ehm…model: this may be the most difficult part, and some Exploratory Data Analysis may be required to pick a good algorithm.
  • Analyze the results: analysis is usually performed visually with graphical methods.
  • Iterate: periodic fine-tuning of the forecasting method may include changing algorithm parameters.

Forecasting the Network Example

Now that we know the basics about the theory of forecasting, let’s implement all the steps and apply moving averages and linear regression to a network dataset.

Dataset

The dataset that we will use is a the Network Anomaly Detection Dataset. It includes Simple Network Management Protocol (SNMP) monitoring data. SNMP is the de facto protocol when it comes to telemetry for network appliances and can track a variety of interesting data related to machine performance, such as bytes in/out, errors, packets, connection hits, etc.

You will find the code referenced in the examples at the Pandas Blog GitHub repository.

Preprocessing

Preprocessing of the data includes cleaning and adding metadata. We need to add dates to this specific dataset.

We begin with the necessary imports and loading the csv file to a Pandas data frame:

import numpy as np
import pandas as pd

network_data = pd.read_csv("../data/network_data.csv")
network_data.columns

Index(['ifInOctets11', 'ifOutOctets11', 'ifoutDiscards11', 'ifInUcastPkts11',
       'ifInNUcastPkts11', 'ifInDiscards11', 'ifOutUcastPkts11',
       'ifOutNUcastPkts11', 'tcpOutRsts', 'tcpInSegs', 'tcpOutSegs',
       'tcpPassiveOpens', 'tcpRetransSegs', 'tcpCurrEstab', 'tcpEstabResets',
       'tcp?ActiveOpens', 'udpInDatagrams', 'udpOutDatagrams', 'udpInErrors',
       'udpNoPorts', 'ipInReceives', 'ipInDelivers', 'ipOutRequests',
       'ipOutDiscards', 'ipInDiscards', 'ipForwDatagrams', 'ipOutNoRoutes',
       'ipInAddrErrors', 'icmpInMsgs', 'icmpInDestUnreachs', 'icmpOutMsgs',
       'icmpOutDestUnreachs', 'icmpInEchos', 'icmpOutEchoReps', 'class'],
      dtype='object')

The table column titles printed above include characteristic SNMP data (such as TCP active open connections, input/output packets, and UDP input/output datagrams) that offer a descriptive picture of performance status and potential anomalies in network traffic. After this we can add a date column or any other useful metadata. Let’s keep it simple here and add dates spaced evenly to days using a column of our data, the ipForwDatagrams:

dates = pd.date_range('2022-03-01', periods=len(network_data["ipForwDatagrams"]))

We are ready to review the fun part of forecasting, by implementing Moving Average.

Moving Average

Pandas has a handy function called rolling that can shift through a window of data points and perform a function on them such as an average or min/max function. Think of it as a sliding window for data frames, but the slide is always of size 1 and the window size is the first parameter in the rolling function. For example, if we set this parameter to 5 and the function to average, we will calculate 5 averages in a dataset with 10 data points. This example is illustrated in the following figure, where we have marked the first three calculations of averages:

 Moving Average

How does this fit with forecasting? We can use historic data (last 5 data points in the above example), to predict the future! Every new average from this rolling function, gives a trend for what is coming next. Let’s make this concrete with an example.

First we create a new data frame that includes our metadata dates and the value we want to predict, ipForwDatagrams:

df = pd.DataFrame(data=zip(dates, network_data["ipForwDatagrams"]), columns=['Date', 'ipForwDatagrams'])
df.head()

Date ipForwDatagrams
0 2022-03-01 59244345
1 2022-03-02 59387381
2 2022-03-03 59498140
3 2022-03-04 59581345
4 2022-03-05 59664453

Then we use the rolling average. We apply it on the IP forward Datagrams column, ipForwDatagrams, to calculate a rolling average every 1,000 data points. This way we use historic data to create a trend line, a.k.a. forecasting!

df["rolling"] = df["ipForwDatagrams"].rolling(1000, center=True).mean()

Finally, we will visualize the predictions:

# Plotting the effect of a rolling average
import matplotlib.pyplot as plt
plt.plot(df['Date'], df['ipForwDatagrams'])
plt.plot(df['Date'], df['rolling'])
plt.title('Data With Rolling Average')

plt.show()
Moving Average

The orange line represents our moving average prediction and it seems to be doing pretty well. You may notice that it does not follow the spikes in the data, it is much smoother. If you experiment with the granularity, i.e., smaller than 1,000 rolling window, you will see an improvement in predictions with loss to additional computations.

Linear Regression

Linear regression fits a linear function to a set of random data points. This is achieved by searching for all possible values for the variables ab that define a line function y = a * x + b. The line that minimizes the distance from the dataset data points is the result of the linear regression model.

Let’s see if we can calculate a linear regression predictor for our SNMP dataset. In this case, we will not use time series data; we will consider the relationship, and as a consequence the predictability, of a variable using another. The variable that we consider as a known, or historic data, is the TCP input segments tcpInSegs. The variable that we are aiming to predict is the output segments, tcpOutSegs. Linear Regression is implemented by linear_model in the sklearn library, a powerful tool for data science modeling. We set the x var to tcpInSegs column from the SNMP dataset and the y var to tcpOutSegs. Our goal is to define the function y = a * x + b, specifically a and b constants, to determine a line that predicts the trend of output segments when we know the input segments:

from sklearn import linear_model
import matplotlib.pyplot as plt

x = pd.DataFrame(network_data['tcpInSegs'])
y = pd.DataFrame(network_data['tcpOutSegs'])
regr = linear_model.LinearRegression()
regr.fit(x, y)

The most important part of the above code is the use of linear_model.LinearRegression() function that does its magic behind the scenes and returns a regr object. This object gives us a function of ab variables, that can be used to forecast the number of TCP out segments based on the number of input TCP segments. If you do not believe me, here is the plotted result:

plt.scatter(x, y,  color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
Linear Regression

The blue line indicates our prediction, and if you ask me, it is pretty good. Now how about trying to predict IP input received, ipInReceives, from ICMP input messages (icmpInMsgs)? Would we achieve such good forecasting? Let’s just change the x and y variables and find out:

x = pd.DataFrame(network_data['icmpInMsgs'])
y = pd.DataFrame(network_data['ipInReceives'])
regr = linear_model.LinearRegression()
regr.fit(x, y)

We use the same code as above to generate the plot. This one does not look nearly as accurate. However, the blue line indicates the decreasing trend of the IP in received packets based on ICMP inputs. That is a good example of where another forecasting algorithm could be used, such as dynamic regression or a nonlinear model.

Linear Regression

Conclusion

We have reviewed two of the most popular forecasting methodologies, moving averages and linear regression, with Python Pandas. We have noticed the benefits and accuracy of forecasting as well as its weaknesses.

This concludes the Pandas series for Network Automation Engineers. I hope you have enjoyed this as much as I have and added useful tools for your ever growing toolbox.

-Xenia

Resources



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Intro to Pandas (Part 2) – Exploratory data analysis for network traffic

Blog Detail

Data analytics is an important skill for every engineer and even more the Network Engineer that goes through large amounts of data for troubleshooting. Networking companies have been moving towards data science integrations for appliances and software. The Arista EOS Network Data Lake is a characteristic example where Artificial Intelligence and Machine Learning are used to analyze data from different resources and lead to actionable decisions.

This blog aims to develop these skills, and it is a part of a series related to data analysis for Network Engineers. The first part was a detailed introduction on how to use Pandas, a powerful Python Data Science framework, to analyze networking data. The second part included instructions on how to run the code in these blogs using Jupyter notebooks and the Poetry virtual environment. This third blog is going deeper into how we can explore black-box networking data with a powerful analysis technique, Exploratory Data Analysis (EDA). Naturally, we will be using Pandas and Jupyter notebooks in our examples.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of statistical and graphical techniques to make sense of any type of black-box data.

Goals of EDA

EDA aims at accomplishing the following goals:

  • Tailor a good fitting: Matching your data as closely as possible to a distribution described by a mathematical expression has several benefits, such as predicting the next failure in your network.
  • Find outliers: Outliers are these odd data points that lie outside of a group of data. An example is a web server farm where all port requests are aimed at specific ports and every once in a while a random port is requested. An outlier may be caused by error or intentional testing and even adversarial attack behavior.
  • Create a list of ranked important factors: Removing unwanted features or finding the most important ones is called dimensionality reduction in data science terms. For example, with EDA you will be able to distinguish using statistical metrics a subset of the most important features or appliances that may affect your network performance, outages, and errors.
  • Discover optimal settings: How many times have you wondered how it would be if you could fine-tune your BGP timers, MTUs, or bandwidth allocation and not just guess these values? EDA helps discover the best value for networking settings.

Why EDA

EDA has been proven an invaluable tool for Data Scientists, why not for Network Engineers? We gather a lot of network data, such as packet captures, syslogs, etc., that we do not know how to make sense of or how to mine their value. Even though we use out-of-box data analytics tools, such as Splunk, the insight of the Network Engineer that stems from building a model and processing raw data, is invaluable.

How to Implement EDA

To implement EDA, you need tools that you probably use in your day-to-day network operations and did not know they were part of EDA:

  • Strong graphical analysis: from single variable plots to time series, and multi-variable plots, there is a graphical tool in EDA that fits your problem.
  • Statistics: this may include hypothesis testing, calculations of summary statistics, metrics for scale, and the shape of your data.

We will explore these techniques with a network dataset in the next section.

EDA for Network Data

In this section, we will review data preprocessing with graphical and statistical analysis EDA techniques.

Dataset

The dataset we will use is a 5GB packet capture of Operating System (OS) scans from the Kitsune Network Attack Dataset. You will find the code referenced below in the Pandas Blog GitHub repository.

Preprocessing

Pre-processing of the data includes cleaning and adding metadata. We will add useful metadata to our dataset.

We start with the necessary imports and reading the csv file to a Pandas data frame:

import numpy as np
import pandas as pd

os_scan_data = pd.read_csv("../data/OS_Scan_dataset.csv")
os_scan_data

That will print a subset of the data since the file is too large:

For more information about Pandas data frames, please check the Intro to Pandas blog post.

Then, we will create metadata timestamp objects using the to_datetime function:

import datetime

timestamps = pd.to_datetime(os_scan_data["Time"], format='%Y-%m-%d %H:%M:%S.%f')
os_scan_data["Time"] = timestamps
print("Timestamps")
print(timestamps)

The timestamps are shown below:

Timestamps
0         2017-08-07 08:17:12.597437
1         2017-08-07 08:17:12.597474
2         2017-08-07 08:17:12.597553
3         2017-08-07 08:17:12.597558
4         2017-08-07 08:17:12.597679
                     ...            
1697846   2017-08-07 09:09:25.354194
1697847   2017-08-07 09:09:25.354321
1697848   2017-08-07 09:09:25.354341
1697849   2017-08-07 09:09:25.354358
1697850   2017-08-07 09:09:25.354493
Name: Time, Length: 1697851, dtype: datetime64[ns]

Finally, we will calculate interesting derivative data, such as the packet interarrivals. To this end, we will use the numpy function np.diff that takes as input a column of numbers and subtracts its rows in pairs:

interarrival_times = np.diff(timestamps)
interarrival_times

The packet interarrival values are printed below:

array([ 37000,  79000,   5000, ...,  20000,  17000, 135000],
      dtype='timedelta64[ns]')

We append the array to the os_scan_data type, casting it to int, and print the columns of the dataset to verify that the Interarrivals column has been appended:

interarrival_times = np.append(interarrival_times, [0])
os_scan_data["Interarrivals"] = interarrival_times.astype(int)
os_scan_data.columns

Below are the column names of our data after the preprocessing:

Index(['No.', 'Time', 'Source', 'Destination', 'Protocol', 'Length', 'Info',
       'Src Port', 'Dst Port', 'Interarrivals'],
      dtype='object')

Now we are ready to create pretty graphs!

Graphical Analysis

In this section, we will focus on two graphical techniques from EDA: histograms and scatter plots. We will demonstrate how to combine the information with jointplots to analyze black-box datasets.

Histogram

The first graph that we will make may not be pretty, however it demonstrates the value and flexibility of Pandas and graphical analysis for data exploration:

os_scan_data.hist(column=["Length", "Interarrivals", "Src Port", "Dst Port"])

With a single line of code and the power of Pandas data frames, we already have a set of meaningful plots. The histogram offers a graphical summary of the distribution of a single variable dataset. In the above histograms, we see how the values of LengthInterarrivalsSrc Port, and Dst Port are distributed, i.e., spread, into a continuous interval of values.

Histograms offer an insight to the shape of our data and they can be fine tuned to give us a better point of view. The main “ingredient” of the histogram is a bin; a bin corresponds to the bars that you see in the graphs above and its height indicates the number of elements that fall within a range of values. The default size of bins in the data frame hist function is 10. For example, a bin of size (width) 10 and height 1000 indicates that there are 1000 values x within the range: 0 <= x < 10. Modifying the bin size is a powerful technique to get additional granularity or a “big picture” view of the data distribution:

os_scan_data.hist(column='Length', bins=20)
os_scan_data.hist(column='Length', bins=100)
hist-length-20hist-length

There is a whole science in how to fine-tune a histogram’s bin size. A good rule of thumb is that if you have dense data, a large size will give you a good “bird’s-eye view”. In the case of packet lengths, we have sparse of data. Therefore, the smaller bin helps us distinguish the data shape.

Scatter Plot

A scatter plot is another common graphical tool of EDA. Using a scatter plot, we are plotting two variables against each other with the goal of correlating their values:

os_scan_data.plot.scatter(x='Interarrivals', y='Src Port')
os_scan_data.plot.scatter(x='Interarrivals', y='Dst Port')

The story narrated by these two graphs is that packet interarrival values to source ports have a wider spread, i.e. 0..2 x 10^7, whereas for destination ports these values have half the spread. That may point to slow response or a high speed scan, such as an OS scan! Part of the story is a high usage of low source and destination port numbers. This may point to OS services running on these ports, that are targeted on a wide spread of intervals.

Joint Plots

Now let’s combine the scatter and histogram plots for additional insight into our data. We will use an additional plotting packageseaborn:

<span role="button" tabindex="0" data-code="import matplotlib.pyplot as plt import seaborn as sns short_interarrivals = os_scan_data[(os_scan_data['Interarrivals']
import matplotlib.pyplot as plt
import seaborn as sns

short_interarrivals = os_scan_data[(os_scan_data['Interarrivals'] < 10000) & (os_scan_data['Interarrivals'] > 0)]
sns.jointplot(x='Interarrivals', y='Dst Port', kind='hex', data=short_interarrivals)
sns.jointplot(x='Interarrivals', y='Dst Port', kind='kde', data=short_interarrivals)
plt.show()

Note that we used the power of Pandas data frame to define a new frame short_intervals, where we take interarrivals that are less than 10K nanoseconds. The hex type plot resembles a scatter plot with histograms on the sides. The color coding of the data points indicates higher concentration of values in this specific area. The kde (Kernel Distribution Estimate) gives a distribution similar to a histogram, however the centralizing values, i.e., kernels, are visualized as well. The three distinct parts of the graph in kde will be described with three different mathematical distributions.

Summary Statistics

Summary statistics as part of EDA are extremely useful when dealing with a large set of data:

short_interarrivals.describe()

With a single line of code, the describe Pandas function gives us several statistics such as percentiles, min, max values, etc. These statistics can lead to distribution fitting and additional insights into the data.

Autocorrelation

Finally, autocorrelation calculations show how much the values within a series, i.e., the length or interarrival values, are related:

length_series = os_scan_data["Length"]
length_series.autocorr()  
0.3938818297281779

interarrival_series = os_scan_data["Interarrivals"]
interarrival_series.autocorr() 
-0.031230988268827732

In this case the packet lengths are positively correlated, which means that if a value is above average, the next value will likely be above average. Negative autocorrelation such as the one that is observed for packet interarrivals, means that if an interarrival is above average, the next interarrival will likely be below average. This is a powerful metric for predictions.


Conclusion

We have reviewed how to use EDA techniques to extract useful information from black-box data. This part of the series data analytics for Network Engineers, offers a deeper understanding of the power of the Pandas library and the statistical techniques that you can implement with it. In the last part of the series, we will review some predictive models. Stay tuned!

-Xenia

Resources



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!

Introduction to Pandas for Network Development

Blog Detail

Pandas is a well known Python framework for Data Scientists. It is used to process large sets of data to derive statistics, perform grouping, and create meaningful visualizations. As data processing is becoming prevalent in IT, Pandas offers a versatile tool for the day to day operations of a Network Engineer. In this post, we will review the basics of Pandas by exploring real world networking examples.

There are numerous blogs and excellent documentation for the Pandas framework. My goal is to give a different perspective as a Network Automation Engineer that has dealt with data processing for various use cases, such as network security data and time series monitoring. This post is a first of a series that will show you how you can infuse data analytics in your day to day work for network operations and automation. If you want to follow this post interactively using a Jupyter notebook all the code and datasets can be found in this Github repository.

Why should a Network Engineer care about Pandas?

Below, you will find specific reasons why you should care about Pandas:

  • Exploratory Data Analysis (EDA): Have you ever had a set of text or excel files and wanted to make sense of the data they hide? EDA is a philosophy that can help you make sense of large amounts of data that seem ineligible, using mainly visual techniques. It helps you find anomalies, uncover data structure, test and prove your hypothesis, for example why a part of a network is a bottleneck or whether it is the network’s fault.
  • Excel is everywhere: Excel is powerful software tool that has been used for a long time for storing and processing network data, as a source of truth for network data, and business intelligence. It serves its purpose, however it has limitations that Pandas fills in programmatically with their data structures, statistics, and visualization.
  • Time series: One of my favorite parts of Pandas is the time functionality that can help you process further the network monitoring data that you was obtained from raw captures or time series databases.
  • Batfish: Did you know that Batfish, a popular tool for network engineers for configuration analysis, uses Pandas data frames to return its results? Well, now you know and you may want to learn even more about Pandas!

What is Python Pandas?

Python Pandas is a rich framework that includes data models, mathematical functions, and visualizations, with a goal to make data processing efficient and simple. Some of the most powerful features of the Pandas framework are listed below:

  • Data modeling: Pandas introduces data models such as data frames and series. This is an efficient way of organizing data for further statistical processing and accessing.
  • Indexing: Pandas has a powerful indexing mechanism that varies from simple indexing to multi-indexing. The goal of indexing is to easily access parts of the data for exploratory analysis or piece-wise processing.
  • Grouping: Pandas offers ways of cleaning up or grouping data based on your specific criteria.
  • Statistical Processing: All the above features lead to seamless statistical processing of groups or data structures that often is performed in a simple, single line of code.
  • Visualization: When it comes to data sets, a picture is worth 1,000 words. Pandas offers a wealth of plotting mechanisms, even interactive plots.

In the rest of this blog, you will review these concepts with a network packet capture use case. The data set I am using is from the Kitsune Network Attack Dataset. I chose a packet capture (pcap) of the Mirai botnet. Mirai is a Distributed Denial of Service (DDoS) botnet attack that used vulnerable Internet of Things (IoT) devices to take down Domain Name Service (DNS) servers. As a consequence, Mirai disrupted the services of high profile sites such as Amazon and Reddit. There is an interesting story behind the dataset and it can demonstrate how a network engineer can be part of a solution to such network attacks.

Data Modeling

Data modeling is dear to the heart of a Network Automation Engineer because it standardizes the way we use Infrastructure as Code (IaS). Data modeling with Pandas aims at organizing and abstracting data for efficient processing.

Series

Series are arrays of a single dimension, i.e., single column multiple row arrays. The main characteristic of a series is that it has an index which can be labels, numbers, timestamps etc.

Let’s look at an example of a series with default index increasing numbers. We first export the packet capture to the file Mirai.csv. The code imports the required libraries and then reads the csv file using the pandas library read_csv:

import numpy as np
import pandas as pd

mirai_data = pd.read_csv("../data/Mirai.csv")

This is a pretty large file and part of the output is shown below:

mirai_data

Now let’s look at a Series data structure:

mirai_series = mirai_data["Source"]
type(mirai_series)

Below you can see what the series looks like and the type of the object:

0             192.168.2.108
1             192.168.2.108
2               192.168.2.1
3               192.168.2.1
4         48:02:2e:01:83:15
                ...
764132    Shenzhen_98:ee:fd
764133    Shenzhen_98:ee:fd
764134    Shenzhen_98:ee:fd
764135    Shenzhen_98:ee:fd
764136        192.168.2.115
Name: Source, Length: 764137, dtype: object

pandas.core.series.Series

Note that the column is indexed automatically from the Pandas framework with increasing numbers.

There is a variety of operations that you can perform and Series attributes to use according to the documentation. Let’s review some here:

  • mirai_series.is_unique: returns False in our case since we have duplicate IPs in this Series.
  • mirai_series.index, returns dimensions of the Series: RangeIndex(start=0, stop=764137, step=1).
  • mirai_series.hasnans, returns False in our case since we have duplicate IPs in this Series.
  • mirai_series.to_list(): transforms a Series to a list.
  • mirai_series.copy(): performs deep copy of your Series.

These attributes and functions are useful when you are trying to format or just “get to know” your data or convert it to a different data structure. You will review functions for statistical analysis and visualization in following subsections.

Data Frames

In contrast to the Series, a Data Frame has two dimensions. This is a labeled data structure that may include different types of data. It is similar to a spreadsheet or an SQL table. Let’s consider again the Mirai data set. If you print the type of the object returned by the function read_csv, you will find out that it is a Data Frame:

import numpy as np
import pandas as pd

mirai_data = pd.read_csv("../data/Mirai.csv")
print(type(mirai_data))
pandas.core.frame.DataFrame

As you have probably suspected, a Data Frame is composed by a labeled list of Series. Each Series represents a column of the data set and in the case of our csv the labels are defined in the top row of the file. If you look at the Data Frame documentation there is a wealth of attributes and functions. Let’s look at some useful attributes:

  • mirai_data.values: gives an array of values, i.e., all the records of your data frame:
array([[1, '2018-10-25 01:46:22.933899', '192.168.2.108', ...,
        '21074  >  80 [SYN] Seq=0 Win=5840 Len=0 MSS=1460', 21074.0,
        80.0],
       [2, '2018-10-25 01:46:22.933904', '192.168.2.108', ...,
        '20532  >  8280 [SYN] Seq=0 Win=5840 Len=0 MSS=1460', 20532.0,
        8280.0],
       [3, '2018-10-25 01:46:22.934426', '192.168.2.1', ...,
        'Destination unreachable (Network unreachable)', 21074.0, 80.0],
       ...,
       [764135, '2018-10-25 03:45:19.840611', 'Shenzhen_98:ee:fd', ...,
        'Who has 192.168.2.167? Tell 192.168.2.110', nan, nan],
       [764136, '2018-10-25 03:45:19.842369', 'Shenzhen_98:ee:fd', ...,
        'Who has 192.168.2.168? Tell 192.168.2.110', nan, nan],
       [764137, '2018-10-25 03:45:19.842464', '192.168.2.115', ...,
        'Standard query 0x1e08 AAAA north-america.pool.ntp.org.Speedport_W_724V_01011603_00_005',
        3570.0, 53.0]], dtype=object)
  • mirai_data.shape: gives you the dimensions of the data frame in the format (number of rows, number of columns) which in the case of the mirai_data is: (764137, 9).

In the next subsections we will review more Series and Data Frame functions for indexing, grouping, and statistical processing.

Indexing

Indexing refers to segmenting data and viewing it in pieces. The Pandas framework offers a variety of indexing mechanisms.

Slicing rows and columns

The following example gives the first three rows of the Data Frame:

mirai_data[0:3]
mirai_slice

We can apply this to a Series as well:

mirai_series[0:3]
0    192.168.2.108
1    192.168.2.108
2      192.168.2.1
Name: Source, dtype: object

Getting data with labels

Another cool way to slice the data is by using the loc function:

mirai_data.loc[:, ["Time", "Source Port"]]
	                    Time	Source Port
0	2018-10-25 01:46:22.933899	21074.0
1	2018-10-25 01:46:22.933904	20532.0
2	2018-10-25 01:46:22.934426	21074.0
3	2018-10-25 01:46:22.934636	20532.0
4	2018-10-25 01:46:23.291054	NaN
...	...	...
764132	2018-10-25 03:45:19.837515	NaN
764133	2018-10-25 03:45:19.839396	NaN
764134	2018-10-25 03:45:19.840611	NaN
764135	2018-10-25 03:45:19.842369	NaN
764136	2018-10-25 03:45:19.842464	3570.0
764137 rows × 2 columns

We just sliced the data by Time and Source Port. The same function can be applied to a Series data structure as you have probably already guessed.

Selection by position

What if we want to get our data at a specific position? For this, Pandas offers the iloc function again applicable to Data Frames and Series data structures.

mirai_data.iloc[3]
No.                                                        4
Time                              2018-10-25 01:46:22.934636
Source                                           192.168.2.1
Destination                                    192.168.2.108
Protocol                                                ICMP
Length                                                    86
Info           Destination unreachable (Network unreachable)
Source Port                                          20532.0
Dest Port                                             8280.0
Name: 3, dtype: object

In this case we visualize the record, i.e., row, 3, with all its labels, i.e., columns.

Another example of slicing is getting a specific set of rows and columns as seen below:

mirai_data.iloc[10:20, [2, 7, 3, 8]]
mirai_slice2

That gives us a good visual of source/destination IPs and ports. Note that you can change the order of your columns to make better sense of your data.

Boolean indexing

One of my favorite indexing mechanisms is to perform boolean operations. What if you want to look for unusually large packets? You can perform this operation with a single line of code:

mirai_data[mirai_data["Length"] > 512]
mirai_slice3

Grouping

Grouping is a way to combine data for further processing. For example, we can group the packets by source IP to calculate the total number of bytes sent from a specific IP:

mirai_data.groupby("Source").sum()["Length"]
Source                          Length
0.0.0.0                         13108
00:ec:04:56:93:03               19860
192.168.2.1                   3721863
192.168.2.101                  627282
192.168.2.103                   98271
192.168.2.104                  381972
192.168.2.105                   33252
192.168.2.107                  116031
48:02:2e:01:83:15              120720
ASDElect_3a:eb:e8               16980
Arcadyan_c6:12:7b              207720
Cisco_28:d6:06                 302556
D-LinkIn_db:4a:e2                7620
Espressi_05:f2:c6               10980
Fn-LinkT_e0:fc:c9               22680
Foxconn_d5:63:5c                 3240
Giga-Byt_4b:99:14                4260
...

Or you can group by more than one labels, for example source and destination port to show that specific sources are targeting specific ports:

mirai_data.groupby(["Source", "Dest Port"]).sum()["Length"]
Source                     Dest Port     Length
0.0.0.0                    67           13108
192.168.2.1                53           736284
                           68           48944
                           80           711354
                           123          14396
                                         ...
192.168.2.196              123           2700
                           51009        84660
fe80::203b:a22a:5501:5006  5355           688
                           8083            83
fe80::f014:8275:ad82:d005  5353           214
Name: Length, Length: 3757, dtype: int64

An interesting observation is that by using this simple calculation we note that port 53 is being used extensively as a destination. Since DNS was part of Mirai’s attack vector our data proves this.

Statistical Processing

Now let’s get to the real fun stuff. Statistics without pain!

The functions describe() is an easy way to get descriptive statistics for all your data, such as count, mean, standard deviation, percentiles, minimum, and maximum values:

mirai_data.describe()

               No.	    Length	  Source Port	    Dest Port
count	764137.000000	764137.000000	18236.000000	18236.000000
mean	382069.000000	66.262442	    33065.793387	6482.248240
std	  220587.495661	19.751378	    18483.721156	7596.797988
min	  1.000000	    42.000000	    0.000000	    23.000000
25%	  191035.000000	60.000000	    21897.000000	53.000000
50%	  382069.000000	60.000000	    32761.000000	8080.000000
75%	  573103.000000	60.000000	    50861.000000	10240.000000
max	  764137.000000	1468.000000	  65267.000000	65267.000000

Note that the operation is performed only on the columns that are considered numbers, such as increasing number, length, and ports and not the IPs. Obviously these statistics do not have a meaning for ports and increasing packet numbers, however there is still a quick way to count how many different ports (count), the min and max port number.

Of course, you can perform statistical functions on specific series that include only numbers:

mirai_data["Length"].cumsum()

0               60
1              120
2              206
3              292
4              352
            ...
764132    50633285
764133    50633345
764134    50633405
764135    50633465
764136    50633584
Name: Length, Length: 764137, dtype: int64

Cumulative sum is useful to see how fast the packet length is growing, but better perform it grouped by meaningful features for the network, such as the source and destination IPs, right?

mirai_data.groupby(["Source", "Destination"]).cumsum()["Length"]
0               60
1               60
2               86
3              172
4               60
            ...
764132    33088080
764133    33088140
764134    33088200
764135    33088260
764136     1199342
Name: Length, Length: 764137, dtype: int64

Histogramming

Discretizing your packet captures is useful for calculations or visualization. Pandas offers a simple function, value_counts to create histograms. A histogram is a representation of your data that uses buckets, i.e., value ranges, to count the number of values in your data set that fit in these buckets. If you had to write the code to create a histogram, you would have to perform the following steps:

  1. Find the minimum and maximum values in your dataset, that would be the range of values. If the values were ports, the range would be (1...65536).
  2. Split the range into a set of equal intervals. In our example for ports, we could split the range into intervals of 1, i.e., the intervals would be: [1,2), [2,3), ..., [65535, 65536) and the [ indicates that the value should be greater than or equal, whereas the ) indicates that the value should be less than.
  3. Finally, you would need to loop through all the values in your dataset and put them in buckets, i.e., count how many occurences of port 12, …, 65536 appear in your data.

All these steps can be performed in a single line of code:

mirai_data["Dest Port"].value_counts()
10240.0    68886
53.0       40447
80.0       24915
23.0        9466
8280.0      8195
...

Again the data proves the attack vector of Mirai was aimed heavily to DNS (port 53), with telnet (port 23) being targeted as part of vulnerable IoT devices with default credentials.

Time series statistics

You may want to do calculations with the dates in the specific pcap. Here is how you convert your timestamps to datetime objects:

timestamps = pd.to_datetime(mirai_data["Time"], format='%Y-%m-%d %H:%M:%S.%f')

In this example, we used the Series data structure that is marked with the label Time and a formatting label to convert the timestamps using the function to_datetime(). Now we can perform neat calculations, such as indexing with dates and calculating statistics:

import datetime

timestamps = pd.to_datetime(mirai_data["Time"], format='%Y-%m-%d %H:%M:%S.%f')
mirai_data["Time"] = timestamps
print("Timestamps")
print(timestamps)

ref = pd.Timestamp('2018-10-25 01:50:36.406909')
print(f"ref: {ref}")
length_std = mirai_data[mirai_data["Time"] > ref]["Length"].std()
print(f"length_std: {length_std}")
Timestamps
0        2018-10-25 01:46:22.933899
1        2018-10-25 01:46:22.933904
2        2018-10-25 01:46:22.934426
3        2018-10-25 01:46:22.934636
4        2018-10-25 01:46:23.291054
                    ...
764132   2018-10-25 03:45:19.837515
764133   2018-10-25 03:45:19.839396
764134   2018-10-25 03:45:19.840611
764135   2018-10-25 03:45:19.842369
764136   2018-10-25 03:45:19.842464
Name: Time, Length: 764137, dtype: datetime64[ns]
ref: 2018-10-25 01:50:36.406909
length_std: 19.420279817398622

The timestamps are converted to datetime64 object as you can see in the output. The variable ref is one of these timestamps, and the length_std is the standard deviation of packets that were recorded after the reference date, ref.

Visualization

Time for some pretty graphs. First a graph of the length of packets of specific flows:

import matplotlib.pyplot as plt

sorted = mirai_data.groupby(["Source", "Destination"]).sum()["Length"].sort_values(ascending=False)
sorted[0:10].plot(kind='bar',alpha=0.75, rot=90, logy=True, color=['r', 'g', 'b', 'r', 'g', 'b', 'r', 'g', 'b', 'r'])
mirai_packet_length

Another proof of how the Pandas framework can help you perform exploratory data analysis with cool visualizations. Here, I am plotting the total number of bytes sent by the top ten flows. The sorted variable uses functions we have already reviewed to group the data by flows and get their length, then we use the function sort_values to get these total bytes in increasing order. We plot using plot and a bar type, rotating the x axis labels by 90 degrees so that we can see the flow source and destination. As you can see, the highest flow is an outlier that throws off our plot, but no worries, we set the logy=True and we are good to go! The color property can add pretty colors to your plots.

Of course, visualizing histograms can be a great guide towards discovering anomalies:

mirai_data["Protocol"].value_counts().plot(kind='bar',alpha=0.75, rot=90, logy=True)
mirai_histogram

Recap

You have had a pretty thorough introduction to a rich framework such as Python Pandas, with cool visualizations, a variety of functions and data manipulation methods. This is just the beginning, in the next two posts I will demonstrate advanced features, such as machine learning algorithms for classification and forecasting. Stay tuned!

-Xenia

Resources



ntc img
ntc img

Contact Us to Learn More

Share details about yourself & someone from our team will reach out to you ASAP!