Network troubleshooting is a common automation use case. Network outages are costly and time-consuming and often require the network engineers to log into network equipment and manually investigate network issues. Working on network operations teams, I quickly noticed that troubleshooting network problems is a playbook of repeatable steps, hence the rationale for automating network troubleshooting with Ansible.
Use Case – BGP
Troubleshooting Layer 3 connectivity tends to lead an operations engineer to jump into multiple routers and check routing. Let’s say internet access has been lost from the WAN edge. If I were troubleshooting this, my instincts would tell me to go to my edge router(s) and check the BGP neighbor going towards my ISP.
east-rtr#show ip bgp summary<...output omitted...>Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd4.4.4.444000000008:11 Idle
From the output of show ip bgp summary the issue can determined, BGP is down toward the ISP. How can Ansible help? This is a simplified example with one router and one WAN connection, but what happens if you have 10, 15, or more BGP relationships you need to check. It is costly to manually log in to each router to check the status of BGP. How can Ansible help?
Checking BGP with Ansible
Here is a sequential listing of what the Ansible playbook is doing.
Run show ip bgp summary outputs from ISP routers.
Use ansible-napalm to get BGP facts on the neighbors for easy reporting.
Create an easy-to-consume report using a Jinja2 template to create a report with BGP neighbor status.
Assemble all the device reports into a single overview report.
Iterate through the neighbors and if a neighbor is down, attempt to ping the destination IP to verify Layer 3 reachability using napalm-ping.
Pre-req
There needs to be a valid Ansible inventory, either a static inventory file or dynamic inventories utilizing an existing SoT (Source of Truth). For demonstration purposes a static file will be used.
Create a simple playbook to execute show ip bgp neighbors on all of the routers in the group called isp_routers.
----name:"PLAY:1 - GET BGP SUMMARY"gather_facts: Falseconnection:"network_cli"hosts:"isp_routers"tasks:-name:"TASK:1 - 'SHOW IP BGP SUMMARY'"ios_command:commands:"show ip bgp summary"register:"output_ios"-name:"TASK:2 - PRINT BGP OUTPUT"debug:msg:"{{ output_ios.stdout[0] }}"
Running the playbook results in the following output.
▶ ansible-playbook pb.yml -u ntc -kSSH password:PLAY [PLAY:1- GET BGP SUMMARY] **************************************************************************************************************************************************************************************TASK [TASK:1-'SHOW IP BGP SUMMARY'] ********************************************************************************************************************************************************************************ok: [east-rtr]ok: [west-rtr]TASK [TASK:2- PRINT BGP OUTPUT] *************************************************************************************************************************************************************************************ok: [east-rtr] =>{"msg": "BGP router identifier 1.1.1.1, local AS number 100\nBGP table version is 416, main routing table version 416\n28 network entries using 6944 bytes of memory\n41 path entries using 5576 bytes of memory\n8/7 BGP path/bestpath attribute entries using 2304 bytes of memory\n4 BGP AS-PATH entries using 128 bytes of memory\n0 BGP route-map cache entries using 0 bytes of memory\n0 BGP filter-list cache entries using 0 bytes of memory\nBGP using 14952 total bytes of memory\nBGP activity 124/96 prefixes, 232/191 paths, scan interval 60 secs\n32 networks peaked at 23:40:21 Jan 7 2021 UTC (6w5d ago)\n\nNeighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd\n4.4.4.4 4 400 0 0 0 0 0 08:21 Idle"}ok: [west-rtr] =>{"msg": "BGP router identifier 2.2.2.2, local AS number 100\nBGP table version is 579, main routing table version 579\n28 network entries using 6944 bytes of memory\n41 path entries using 5576 bytes of memory\n8/7 BGP path/bestpath attribute entries using 2304 bytes of memory\n4 BGP AS-PATH entries using 128 bytes of memory\n0 BGP route-map cache entries using 0 bytes of memory\n0 BGP filter-list cache entries using 0 bytes of memory\nBGP using 14952 total bytes of memory\nBGP activity 158/130 prefixes, 267/226 paths, scan interval 60 secs\n32 networks peaked at 23:40:21 Jan 7 2021 UTC (6w5d ago)\n\nNeighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd\n8.8.8.8 4 400 0 0 0 0 0 18:52 1"}PLAY RECAP ***********************************************************************************************************************************************************************************************************east-rtr: ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0west-rtr: ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
At this point you have a single pane to quickly check all the BGP neighbors; however, it’s hard to read the output. To take this playbook to the next level, we can easily take command output and create structured data using one of the various cli parsing modules.
With multiple devices in our inventory group, a file per device will be written. Parsing through multiple files can slow down the time to resolution; therefore, merging all these files together into one all-encompassing report will be done in the next task.
The Ansible assemble module will be used to merge all the reports together.
-name:"TASK:3 - ASSEMBLE REPORTING FROM HOST DETAILS"assemble:src:"./build" # Directory with files to merge.dest:"./reports/report.txt" # Merged output filename.
Once TASK:3 executes, one report is generated with the following output:
After the reachability check is completed, print the results for the DOWN neighbors.
-name:"TASK:5 - PRINT PING RESULTS FOR DOWN NEIGHBORS"debug:msg:"{{ item['ping_results'] }}"loop:"{{ neighbor_down['results'] }}"when:"item['ping_results'] is defined"
Valuable troubleshooting data was gathered by running this playbook. A BGP neighbor is down on east-rtr. Details about all neighbors were also collected, including: enabled state, current neighbor state, and sent/received route counts. Finally, for any DOWN neighbors a reachability check using ping was performed. Most importantly, all this data was assembled across all our isp_routers in just seconds. This was still a simplified example with only two routers, but extrapolating this across tens, hundreds, or more routers is very powerful.
It is important to mention that additional tasks could be added to this playbook to troubleshoot further, for example:
Check the routing to the neighbor IP.
Grab the next-hop IP from the route entry.
Verify that the ARP table for the next-hop IP has a MAC entry.
Full Playbook
-name:"PLAY:1 - GET BGP SUMMARY"gather_facts: Falseconnection:"network_cli"hosts:"isp_routers"tasks:-name:"TASK:1 - 'SHOW IP BGP SUMMARY'"ios_command:commands:"show ip bgp summary"register:"output_ios"-name:"TASK:2 - PRINT BGP OUTPUT"debug:msg:"{{ output_ios.stdout[0] }}"-name:"PLAY:2 - USE NAPALM BGP FACTS"gather_facts: Falseconnection:"network_cli"hosts:"isp_routers"tasks:-name:"TASK:1 - 'GET BGP FACTS'"napalm_get_facts: filter="bgp_neighbors"register:"bgp"-debug:var=bgp-name:"TASK:2 - 'GENERATE REPORT'"template:src:"./templates/bgp_report.j2"dest:"./build/{{ inventory_hostname }}.txt"-name:"TASK:3 - ASSEMBLE REPORTING FROM HOST DETAILS"assemble:src:"./build"dest:"./reports/report.txt"-name:"TASK:4 - PING BGP NEIGHBORS THAT ARE DOWN"napalm_ping:hostname:"{{ inventory_hostname }}"username:"{{ ansible_user }}"password:"{{ ansible_password }}"dev_os:"{{ ansible_network_os }}"destination:"{{ item['key'] }}"with_dict:"{{ bgp['ansible_facts']['napalm_bgp_neighbors']['global']['peers'] }}"when:"not item['value']['is_up']"register:"neighbor_down"-name:"TASK:5 - PRINT PING RESULTS FOR DOWN NEIGHBORS"debug:msg:"{{ item['ping_results'] }}"loop:"{{ neighbor_down['results'] }}"when:"item['ping_results'] is defined"
Conclusion
BGP troubleshooting is one of a multitude of operational troubleshooting playbooks that could be executed for troubleshooting connectivity issues. Taking these same steps to other use cases can greatly improve MTTR on network issues and outages. Furthermore, these playbooks can be extended using a module to update ITSM ticket notes, or even for use during an existing daily network readiness task.
Does this all sound amazing? Want to know more about how Network to Code can help you do this, reach out to our sales team. If you want to help make this a reality for our clients, check out our careers page.