Ten Problems I Know about Your Fortune 500 Network
My role at Network to Code involves speaking with many of our customers’ network teams. In working with our customers, I’ve noticed that there are seemingly more commonalities between them than not.
Note: Before you feel the need to proclaim “not your fortune 500 org”, or “we solved these problems already”, feel free to ignore the clickbait-y title and realize there are no absolutes.
Let’s dive right into that top 10 list!
- There is a data problem and no one trusts the data
- There is no shortage of tools
- There are configuration standards but no one uses them
- There is no way to manage the firewall rules
- There is no ownership of the firewall rules
- There are a ton of LB requests
- There are planned circuit outages that are simply ignored
- There is no way to keep up with OS upgrades
- There is no way to automate my job, it’s meetings and emails
- There are no workflows within my team
1. There is a data problem and no one trusts the data
Face it, you have your spreadsheets (it’s practically a Network trope at this point) that you save a local copy of; you take data from one system and copy to the next; and the data in the IPAM isn’t any good (but you also don’t update it yourself). Every system owner wants “their system” to be the owner of data and doesn’t consider the other systems.
There is no automated synchronization, and chances are you have ServiceNow, which per the organizational direction should be the SoT, except the ServiceNow team is probably only talking about being the SoT for inventory and not for configuration management (VLANs, interface, routing protocol, etc.). To further complicate things, chances are you leverage the discovered state (also called actual state or observed state) as if it were the intended state—if that doesn’t make sense to you, take a few minutes to research the difference.
Data gets “thrown over the fence” to the network team, and there is no shared ownership of this data. This is especially true for firewall rules, where inaccurate requests are routinely made and the requesters believe that the ownership of the rule belongs to the person who put the rule in.
2. There is no shortage of tools
You have every tool, you can’t get rid of any tool for good, and there is always the next tool. If you were to sit down with all of your network engineers, you can guarantee many of them do not know how to log in to many of the tools that you are maintaining and often paying for.
The data is there, but it takes an expert to understand and correlate between all of the various tools.
Whenever a new device is added, it should be added to all the tools. But inevitably a tool is missed and the process is never fully followed. All the tools are not kept up with and they have so many false positives, known down devices, etc. that the tool is considered unreliable. The tool may seem no good, when in reality it’s another data issue.
3. There are configuration standards but no one uses them
The standards exist, they were published, but nobody uses them; and worse yet, no one has told the owner of the standards. When the owner of the standards finds out, they simply can’t understand why they are not being used. There are likely little inaccuracies in the standards that don’t work in production, but again no one has updated the standards.
Instead, every network engineer has independently come up with their own process, which is generally described as something like “I have a notepad on my desktop that I just use” or “I go to this one site I know is working and done correctly.” In doing so you have unknowingly created a digital signature of “who configured this device”, with things like preferring underscores over dashes, or CamelCase vs snake_case. In the end “it’s just easier.”
4. There is no way to manage the firewall rules
There are too many reasons to create a firewall request, zero-trust, new data centers, new applications, expanded applications, etc. The configuration standards aren’t set, and there isn’t any documentation about when to optimize the group objects.
Everyone knows what a “good rule” is, except no one has written it down and everyone has a different interpretation of what is a good rule. Infosec checks the rules, but no one knows what they are checking for either. The term “least privilege” is thrown around a lot, but that has made the whole rule-set so complicated that the security posture is made of Swiss cheese.
The form is an Excel spreadsheet, but anything can be included in it, and more time is spent figuring out what the actual request is for than the other parts of the process combined. The application owners don’t understand why it’s so complicated and simply want it to work. The network engineers don’t understand how the application owners don’t know how their own system works.
5. There is no ownership of the firewall rules
This is true not only in a per rule manner but in a philosophical manner as well.
Within an organization the requester presumes that the firewall team owns the rule, but the firewall team presumes that the requester owns the rule. But they were just the implementer. Amazingly, no one has ever had this conversation.
There is no understanding of who owns any specific rule in a firewall. But if you ask anyone, they will say this is tracked in their ITSM tool. However, that only scales to finding a few rules by reverse engineering a rule based on guessing “what you think the request would look like”, but not actually being able to understand “who owns this rule today”. This is only compounded by the fact the the initial requester may have moved roles or jobs altogether.
6. There are a ton of LB requests
It’s not as bad as firewall requests, but there are a ton of LB requests. There are far too many configuration options requested, but they are likely not really needed.
Many of the same issues with firewall requests exist, such as configuration standards not being well defined or followed and configuration objects being difficult to maintain.
7. There are planned circuit outages that are simply ignored
There are too many circuits, and all the outage notifications get ignored—it is too hard to react to them anyway. When a circuit goes down, the first thing to do is check whether there is already a scheduled maintenance, which takes time since the circuit ID that is on record may not be the circuit ID that the vendor sent the maintenance notification on.
8. There is no way to keep up with OS upgrades
With the amount of security vulnerabilities sent within a given year, there would have to be a dedicated team to manage the OS upgrades. Also, the OS upgrade is not that hard, it’s that everything else around the OS upgrade is hard. That includes defining a healthy device before a change, what defines a healthy device after a change, scheduling, reducing production downtime, etc., all of which add to the actual challenge of performing OS upgrades.
9. There is no way to automate my job, it’s meetings and emails
There is entirely too much time spent within meetings and responding to emails, “I don’t work within Ops”, you can’t help me with what I do for automation.
- Running through daily checks, to present at NOC handoff meeting
- Read through dozens of potential outage emails
- Taking requests on projects from customers
I am not spending time to solve these questions, but safe to say, this tends to not be true and often can be automated.
10. There are no workflows within my team
Especially with the engineering and architecture teams the work is project based and “we don’t believe that we really have workflows”. However, in every case there are always workflows. And only when you understand the workflows can you identify problems and potentially even automate.
This is such a large issue that we as Network Engineers don’t take a workflow-centric view of things and attempt to fix everything without understanding the basics of the workflow.
Often times framing the question as “if you were going on vacation for 3 months, how would you document what you do for a new member of the team”, is a good start.
Conclusion
For nearly every one of these problems, we at Network to Code have been building code, methodology, and specific processes, but I didn’t want to discuss the solutions before identifying the problems. Perhaps if there is some interest, I can follow up with strategies to help solve many of these.
-Ken
Tags :
Contact Us to Learn More
Share details about yourself & someone from our team will reach out to you ASAP!