Thursday, 3 January 2019

                  vPC – Part III


Failure Scenarios:
1. vPC peer link failure:
When vPC encounters a peer link failure,  following sequence of events happen:
  • Peer Status will be changed to “Peer Link is down” on both the vPC switches.
  • As the Peer Keepalive link is up, both the switches will know that their peer is alive.
  • So they will retain their vPC role and won’t take on the active role, hence we will not be put into Split  Brain/Dual Active Situation.
  • Peer link failure means loss to East-West traffic , so to minimize this loss secondary peer suspends its  Member ports except Orphan ports.
  • This will prevent the duplication of frames and loops in the network.
  • Unfortunately, this will blackhole the traffic for orphan ports.
Lets shut down the peer link and see that even though the peer link is down, vPC is still operational.
f1
On Secondary Peer, notice that vPC member ports are down with reason “Peer-link is down
f2
So the updated topology is as below:
peerli
What if a new port is added when peer-link has already failed, will it come up???
So lets revise the order of operations and it does have consistency check as one of the check. The failed consistency  check will keep the new ports down as well and they will be brought up, once Peer Link outage is restored.
2. vPC Keepalive Link Failure:
  • When vPC keepalive Link Fails and peer link is still up, both the switches are still receiving BPDUs from each  other.
  • So they will retain their vPC roles and this will not impact the overall functionality of the vPC.
vPC status once the keepalive link is made shut:
Primary:
f4
Secondary:
f5
3. Peer and Keepalive link failure:
When both the links fail, each switch will assume that peer is dead and take on the operational primary role.  Hence we will end up in a Active-Active Scenario. Both the switches will now forward traffic and can form  Layer 2 loops.
f8
This sort of situation require a manual intervention for recovery.
Primary acting as Primary:
f6
Secondary also acting as operational Primary:
f7
Though it sounds rare, but poor network design may bring both the Peer and Keepalive links to fail at the same time. This will  happen under below circumstances:
  • Both the links share the same Module and we have a Module failure on the switch.
  • Both the links are on different Modules but are connected to Peer Switch via a common Layer 2 device  and we have a failure on that common device.
As a best practice, bundled links and keepalive link should not share the same fate. There should be  redundancy in place in case failure occurs.
In case keepalive link is on SVI then “dual-active exclude interface-vlan” command can be used to keep the  SVI up in case of link failure.
4. Primary Peer Switch failure
Suppose we had a power Outage and primary switch is powered off. The secondary switch will  consider the peer to be down as both the peer and keepalive links went down. Once three keepalives are missed, secondary will take over the role of Primary and start forwarding  the traffic.
f12
When the primary switch comes up, it will resume the operational secondary role as the vPC role is  non-preemptive. This is because preemption will incur a traffic loss , which is not acceptable.
So if you come across a output where role is “Secondary, operational primary” , this indicates that this  is result of past failure.
5.  Primary Switch and Peer Link Failure
Think of a situation when first we had a peer link failure , secondary will shut down its member ports.  Primary was forwarding the traffic and suddenly it also fails.  The secondary will stop getting heartbeats and  will suspect that primary has failed. When three keepalives are missed, secondary will unshut the ports and  assume the primary role.
peerli
f11
As keepalives are sent every second, so there will be traffic disruption of around 4-5 seconds. This can be  minimized by setting the keepalive interval to lower value using below command:
f9
6. Power outage on both vPC peer switches
If there is a power outage and both the switches go down, then vPC will be completely down causing a  complete outage. Once the power is restored, if only one of the switch comes up then keepalives will  not be heard, hence peer link will not come up. This will also not allow the member ports to come up.
So even if one of the switch is restored, we are still experiencing a complete isolation. “Auto recovery”  is an option used to overcome this failure situation.
f10
This will allow the switch to assume primary role and start forwarding the traffic in case peer does not  come up.
In our coming vPC series, we will discuss about vPC flavors , Data Plane forwarding and HSRP with vPC.

No comments: