(I found this draft that I started several years ago and decided it was time to finish it as it is especially relevant with the rise in container networking)**_
I have had been investigating and looking at strategies around Network Function Virtualization (NFV) and Software Defined Networking (SDN) for quite some time for my day job. This is due to the fact that as Managed Service Provider it allows us to provide just in time deployment and purchase of network services which align costs to the customer that is requesting it without huge capital outlays and unused capacity.
In the past, Service Providers would typically purchase very large devices and carve them up to service multiple clients (not unlike what we do with Hypervisors and Storage Virtualization today) using logical instances that provide separation from a Layer 3 perspective. Things like VRF’s and Virtual Firewall instances give us the ability to provide secure Multi-tenancy while only deploying a single hardware cluster. The problem with this is that it costs upwards of several hundred thousand dollars per platform to have the scalability necessary to support potentially hundreds of instances.
Imagine having to spend that amount of money when you need just one more instance…
Fast forward to today where we can somewhat completely ignore the hardware on the low end and deploy only the software we need on commodity x86 infrastructure. Just like back in the days of Checkpoint on Nokia, you can now deploy a virtual firewall from nearly any vendor on [insert favorite hypervisor] on [insert favorite server vendor]. There are tremendous gains to be had utilizing this approach especially where you are already offering Virtualized Server instances.
Do I never need to buy hardware again?
NFV shines in scale out architectures, however its weakness comes when you try and use it in a scale up architecture. Imagine for a second that you need to support 100Gb of Firewall traffic. If you can spread that across 10 x 10Gb NFV Firewalls and scale that out as you need more it works great but you cannot get 100Gb of performance from a single instance no matter how powerful the host.
Scale out is the norm for Cloud scale computing and has become more relevant in the Enterprise Datacenter with the adoption of “on-prem” Private Cloud. As an example, vmWare has been using the Distributed Virtual Switch (DVS) for years now and with the introduction of NSX it is now required if you want to deploy it. (more about this in my next post…)
Scale out causes problems
One of the reasons in my opinion that NFV never really took off in the Enterprise is simply because distributed management causes significant operational overhead. The fact remains that it is easier to manage a single device than several spread across your environment. This can be seen in how a majority of Enterprise Edges have been backhauling traffic to centralized Data Centers for years whether it be Internet egress traffic or application/file servers. For Cloud and Service Provider’s (SP’s) this being solved by investing in Management and Orchestration (MANO) solutions to allow for the easy deployment and centralized management. These providers have an advantage that their designs are relatively homogeneous whereas environments in the Enterprise are generally not as consistent means that Orchestration solutions have to be highly customized.
Things to Consider: East/West vs North/South Flows
The management of distributed NFV appliances is not just an operations overhead, but also has engineering challenges. How do you align traffic and workloads to the appropriate appliance. Balancing load across a number of NFV appliances while simultaneously trying to keep your application flows from having to traverse multiple hosts becomes a problem. If for example you wanted to change out a firewall for a virtual one hosted on some Hypervisor platform, you would first need to look at where the flows come from and how much traffic is passing through it. You would then need to understand if a single firewall instance / host will be able to handle this or if you will need to divide it up. Deciding how to divide traffic if required could be easy or near impossible depending on the type of flows that you have in your network.
For sure, planning for a high level of balance and symmetry can lead to design paralysis. There is benefit from understanding what the flow requirements are especially as the network and your applications evolve and change. Sub-optimal routing
Why is NFV important?
One of the things that I have taken away from studying how Cloud-Scale companies are deploying infrastructure is to align functionality and services to the application that needs it. In a lot of Enterprise network designs we have to build to support the one off application or service instead of building for the 90% use cases. I remember in the past some network refreshes that I was part of where the Datacenter network platforms were selected were only because of the need to support WCCP…
For this reason, I am a huge supporter of Pod-style network designs and container-based applications. If you have a legacy system that requires some protocol that the rest of the environment doesn’t need, build an island for it instead of dragging that debt into your new design. This will allow you to take advantage of new features more quickly and more easily remove old protocols and network configuration when you retire legacy applications and services.
With container-based applications we can take this a step further by taking pushing the functions that an application might consume from thet network into the application stack itself. By taking this approach, end to end services for a particular application are now independent of the larger network.
Shifting Network Functions into the Application
This means that if you need a new feature, need to upgrade or do any type of operational maintenance, you can do it without impacting the other applications on the network. It also provides a way to easily retire network configuration when an application is decommissioned because the configuration is self contained inside the application stack. The result is significantly less operational overhead and reduction in technical debt over time.
Conclusion
Taking services out of the network reduces the complexity which can lead to significant increases in reliability, performance and operation supportability. These gains in turn allow network operators to spend more time focusing on improving the quality of the network and supporting business initiatives and challenges.