Hi,
I have a HP c7000 blade enclosure with couple of BL460c Gen8 blades, and two HP VC Flex-10/10D interconnect modules. In the blades I have OpenStack Pike installed with latest CentOS 7 as host operating system and neutron is configured to use linux bridge to connect VMs (virtual machines) to the external world. The flow is like below:
external switch<-->HP virtual connect<-->[eno1 bridge tap]<-->[eth0 cirrosVM]
The eno1 is the first physical interface CentOS 7 sees in the BL460c blades, and the bridge is used to connect the cirrosVM to the outside network. The tap interface is used by the neutron to bridge toghether the eth0 interface in the cirrosVM and the eno1 interface in the CentOS.
The problem is the following:
When the cirrosVM wants to access any IP address in the external network, an arp request is sent out on the eth0 interface, and enters the linux bridge on the tap interface. In its forwarding table, the bridge assigns the eth0 mac address to the tap interface, and sends the arp request out on the eno1 interface. Now, some device on the left side of the bridge (I am sure the HP virtual connect does that), broadcasts that arp requests back, therefore the same arp request enters back the bridge on the eno1 interface, and the bridge assigns the source mac address of the arp request (which is still the eth0 mac address) to the eno1 port in the forwarding table. This behaivious, with the eth0 mac address wrongly assigned to the eno1 port in the bridge (while it should be assigned to the tap interface), causes all IP communication to stop working for that IP address.
So the question is: why does the HP virtual connect sends back the arp request broadcast? Is there a feature in the HP virtual connect that could disable this behaiviour? Or is this a bug in the HP virtual connect?
In the flow above, the external switch (HP ProCurve 2848) is used as a router as well, therefore the layer 2 broadcast domain is terminated in that switch. I am sure the external switch does not cause this behaiviour, since I have used port monitoring to trace, and I see the arp request arriving, but not going out again.
However, if I use tcpdump and trace on the eno1 interface, I see the arp request going out and comming back in. If I use the bridge monitor command, I can see the bridge assigning the eth0 mac address to the tap interface first (when the arp request goes out), then to the eno1 interface when the arp request comes back in.
Initially instead of the eno1 interface, I have bonded eno1 and eno2 toghether, therefore both Flex-10/10D switch modules where in use. I thought this arp request behaiviour might be caused by some loop in the Virtual Connect, however even now, when only eno1 is in use (therefore no loop), the same issue happens.
The workaround I have found is to convert the bridge to a hub, by setting ageing to 0 (brctl setageing br-name 0). In this way, the bridge floods all packets on all attached bridge interfaces (eno1 and tap interface), and everything starts working. However, this is not a long term solution.
I am wondering if anybody else has seen this issue, and if there is any solution for it.
Many thanks in advance,
Adrian