Below is something I want to try and explain, how the active/active XenServer load balancing works.
I did some digging into this and read several detailed docs (whatever I was able to find). Active/Active is basically pick a port in the bond with the least amount of bandwidth and pin traffic to it using balance-alb algorithm. Shift to another port in the bond based on utilization. But in order to change ports, I’ll send a GARP out. Keep checking every so often (10 seconds/30 minutes).
In a nut shell:
1. Traffic is pinned to one interface
2. Re-balancing takes place every 10 seconds (could be 30 minutes now since v6.1) to verify that the load of the two interfaces are close to equal (not uneven/makes sure the load is balanced) – This is where issues could arise because it could keep sending Gratuitous ARP’s out to change the ARP entries on the Cisco side (MAC shifting between links).
3. If it needs to re-balance, the Bond interface sends a GARP to change the ARP table on the Cisco switches – There could be GARP issues where it was never received on the Cisco side or if the Cisco side just doesn’t respond to the ARP Response (broadcast)
4. Traffic (entire traffic) is now shifted and pinned on the other interface until there’s an uneven load on the interfaces again. It can keep shifting traffic.
Multiple VM’s (VM1-4) Load Balanced Through Two NICs that are bonded (NIC A/B) – it can only use one link in the bond at a time:
VM1 – NIC-A
VM2 – NIC-B
VM3 – NIC-A
VM4 – NIC-B
*these are pinned until the re-balancing decides if it needs to move to another NIC. If it decides it needs to change, it will shift the entire traffic load to the other NIC and this is where you can have issues.
It can NOT do this:
VM1 – NIC-A & NIC-B – This is LACP
VM2 – NIC-A & NIC-B – This is LACP
VM3 – NIC-A & NIC-B – This is LACP
VM4 – NIC-A & NIC-B – This is LACP
Disadvantages of Active/Active:
Aggregation – There is none and you are limited to 1G
Re-balancing – Could cause issues because it means that GARP is sent out to change ARP on the Cisco side. This could happen often. These changes means communication could break because of the ARP changes on a the Cisco side when the entire traffic shifts to the other interface. If it’s frequently which it could be based on the algorithm, then i’m sure it’s NOT always smooth sailing. MAC shifting between links often can not be good. I would avoid this in larger environments. Mom and Pop shops can get away with this because of the amount of traffic, but for an enterprise or larger environments, I would go with LACP. Always think future and scalability.
The best setup is LACP where both ports are aggregated (possible to use 2G and not 1G) and there’s no GARP or re-balancing going on. This is actually the preferred method. Both ends are basically one logical port talking to each other. No more GARP changes and re-balancing. You also have more bandwidth to use since you’re no longer stuck pinned to 1G. But be careful with that last sentence. I’ll create a post on link aggregation because there’s a lot of confusion on this.
There was document I sent out but it wasn’t understood by the people I sent it to:
https://support.citrix.com/article/CTX132559
It’s basically stating you can only use Active/Active when the switch side is ONE logical switch. So for Cisco that means you need to be running VSS/VPC or switch stacking. This is because of all the mac shifting between ports. The Switch has to be smart enough to keep track of these changes. Makes sense. It’s a lot more technical than this but it’s basically what’s needed on the switch side.