ACI Terraform scalability

The context

This blog is to explains the scalability issue I have faced while trying to use Terraform to deploy Cisco ACI Fabric. Terraform has been developed initially to deploy Cloud infrastructure. As everything is virtual in a Cloud environment, it’s more or less easy to organize your Terraform infrastructure per application to keep the environment small enough. As a best practice it’s recommended to keep the workspaces as small as possible to avoid performance issues. As Terraform works on a dependency graph, the more resources and dependencies you have, the longer it takes for Terraform to compute a plan and push the configuration. To come back to the issue and give a little bit of context, let’s take the fabric below as an example.

  • The Cisco ACI fabric is in network centric mode (1 EPG = 1 BD = 1 vlan).
  • The fabric has 6 leaves and 2 spines.
  • Each leaf has 48 ports.
  • 32 ports are configured on each leaf to connect hypervisor.
  • 100 vlans are configured on each port.

We are using a small Terraform Enterprise agent to push the configuration with 2G of RAM and 2vCPU.

The problem

When provisioning ACI, 2 kind of resources need to be monitored carefully.

  • The number of resources tied to vlans because to create 1 vlan we need 3 resources
    • aci_application_epg
    • aci_bridge_domain
    • aci_epg_to_domain
  • The static path binding
    • aci_epg_to_static_path

Resources tied to vlans

For the first kind of resources, if you have 300 vlans, you will end up with at least 900 resources. After you have deployed the fabric the increase in resources stays most of the time very slow and constant. Every new vlan is ~3 new resources and will not make a big difference straight away.

Static path binding

The second kind of resource is much more problematic. You need one static path binding resource per vlan per port. In our case it would be 19200 resources and that’s a lot !!! (100 vlans x 32 ports x 6 leaves = 19200 static path binding resources). In my case I’ve managed to pushed approximately 7000 resources and it was taking more than 1h to plan and 1h to apply with a Terraform Enterprise agent. Every time you will add a new server, unless you have a single interface you will add at least 200 resources.

The workaround

What option do we have to overcome this major road block ?

  • The first option would be to throw more CPU and RAM on the agent and change the parallelism option. I’ve tried to double the CPU and RAM on the server but the improvement wasn’t significative.

  • The second idea that comes to mind is to split the infrastructure into small workspaces but it’s very difficult when working with ACI in network centric approach. Even if you manage to split workspaces per business unit, your biggest business unit may have 100 vlans. Then, you can try to split your business unit per application but in the end, you probably don’t want to have the configuration of a single physical port split into 10+ workspaces.

  • The third option could be to use AAEP binding to EPG. For a green field deployment and if you don’t mind about the constraint1 it would be the best approach. For a brown field deployment, it might be difficult to change your design and configuration of the production environment.

  • The forth option can be to not manage the static path binding through Terraform and use some other automation magic to take care of this bit with some CI/CD pipeline unicorn dust :D

If I have missed something, you can contact me via linkedin or twitter.


  1. EPG association with the AEP without static binding does not work in a scenario when you configure the EPG as Trunk under the AEP with one end point under the same EPG supporting Tagging and the other end point in the same EPG does not support VLAN tagging. While associating AEP under the EPG, you can configure it as Trunk, Access (Tagged) or Access (Untagged). ^

Related