Re: [PATCH] net/bonding: send arp in interval if no active slave

From: Andy Gospodarek
Date: Tue Sep 01 2015 - 11:42:23 EST


On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
> >Uwe Koziolek <uwe.koziolek@xxxxxxxxxxx> wrote:
> >
> >>On2015-08-17 07:12 PM,Jarod Wilson wrote:
> >>>On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
> >>>>On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
> >>>>>From: Uwe Koziolek <uwe.koziolek@xxxxxxxxxxx>
> >>>>>
> >>>>>With some very finicky switch hardware, active backup bonding can get
> >>>>>into
> >>>>>a situation where we play ping-pong between interfaces, trying to get
> >>>>>one
> >>>>>to come up as the active slave. There seems to be an issue with the
> >>>>>switch's arp replies either taking too long, or simply getting lost,
> >>>>>so we
> >>>>>wind up unable to get any interface up and active. Sometimes, the issue
> >>>>>sorts itself out after a while, sometimes it doesn't.
> >>>>>
> >>>>>Testing with num_grat_arp has proven fruitless, but sending an
> >>>>>additional
> >>>>>arp on curr_arp_slave if we're still in the arp_interval timeslice in
> >>>>>bond_ab_arp_probe(), has shown to produce 100% reliability in testing
> >>>>>with
> >>>>>this hardware combination.
> >>>>Sorry, I don't understand the logic of why it works, and what exactly
> >>>>are
> >>>>we fixiing here.
> >>>>
> >>>>It also breaks completely the logic for link state management in case
> >>>>of no
> >>>>current active slave for 2*arp_interval.
> >>>>
> >>>>Could you please elaborate what exactly is fixed here, and how it
> >>>>works? :)
> >>>I can either duplicate some information from the bug, or Uwe can, to
> >>>illustrate the exact nature of the problem.
> >>>
> >>>>p.s. num_grat_arp maybe could help?
> >>>That was my thought as well, but as I understand it, that route was
> >>>explored, and it didn't help any. I don't actually have a reproducer
> >>>setup of my own, unfortunately, so I'm kind of caught in the middle
> >>>here...
> >>>
> >>>Uwe, can you perhaps further enlighten us as to what num_grat_arp
> >>>settings were tried that didn't help? I'm still of the mind that if
> >>>num_grat_arp *didn't* help, we probably need to do something keyed off
> >>>num_grat_arp.
> >>The bonding slaves are connected to high available switches, each of the
> >>slaves is connected to a different switch. If the bond is starting, only
> >>the selected slave sends one arp-request. If a matching arp_response was
> >>received, this slave and the bond is going into state up, sending the
> >>gratitious arps...
> >>But if you got no arp reply the next slave was selected.
> >>With most of the newer switches, not overloaded, or with other software
> >>bugs, or with a single switch configuration, you would get a arp response
> >>on the first arp request.
> >>But in case of high availability configuration with non perfect switches
> >>like HP ProCurve 54xx, also with some Cisco models, you may not get a
> >>response on the first arp request.
> >>
> >>I have seen network snoops, there the switches are not responding to the
> >>first arp request on slave 1, the second arp request was sent on slave 2
> >>but the response was received on slave one, and all following arp
> >>requests are anwsered on the wrong slave for a longer time.
> > Could you elaborate on the exact "high availability
> >configuration" here, including the model(s) of switch(es) involved?
> >
> > Is this some kind of race between the switch or switches
> >updating the forwarding tables and the bond flip flopping between the
> >slaves? E.g., source MAC from ARP sent on slave 1 is used to populate
> >the forwarding table, but (for whatever reason) there is no reply. ARP
> >on slave 2 is sent (using the same source MAC, unless you set
> >fail_over_mac), but forwarding tables still send that MAC to slave 1, so
> >reply is sent there.
> High availability:
> 2 managed switches with routing capabilities have an interconnect.
> One slave of a bonding interface is connected to the first switch, the
> second slave is connected to the other switch.
> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
> remember also HP E 3500 and E 3800 are also
> affected, for the affected Cisco models I can't answer today.
> Affected single switch configurations was not seen.
>
> Yes, race conditions with delayed upgrades of the forwarding tables is a
> well matching explanation for the problem.
>
> >>The proposed change sents up to 3 arp requests on a down bond using the
> >>same slave, delayed by arp_interval.
> >>Using problematic switches i have seen the the arp response on the right
> >>slave at latest on the second arp request. So the bond is going into state
> >>up.
> >>
> >>How does it works:
> >>The bonds in up state are handled on the beginning of bond_ab_arp_probe
> >>procedure, the other part of this procedure is handling the slave change.
> >>The proposed change is bypassing the slave change for 2 additional calls
> >>of bond_ab_arp_probe.
> >>Now the retries are not only for an up bond available, they are also
> >>implemented for a down bond.
> > Does this delay failover or bringup on switches that are not
> >"problematic"? I.e., if arp_interval is, say, 1000 (1 second), will
> >this impact failover / recovery times?
> >
> > -J
> It depends.
> failover times are not impacted, this is handled different.
> Only the transition from a down bonding interface (bond and all slaves are
> down) to the state up can be increased by up to 2 times arp_interval,
> If the selected interface did not came up .If well working switches are
> used, and everything other is also ok, there are no impacts.

So I'm not a huge fan of workarounds like these, but I also understand
from a practical standpoint that this is useful. My only issue with the
patch would be to please include a small comment (1-2 lines) in the code
that describes the behavior. I know we have the changelog entries for
this, but I would feel better about having an exception like this in the
code for those reading it and wondering:

"Why would we wait 2 intervals before failing over to the next interface
when there are no active interfaces?"


>
> >>The num_grat_arp has no chance to solve the problem. The num_grat_arp is
> >>only used, if a different slave is going active.
> >>But in our case, the bonding slaves are not going into the state active
> >>for a longer time.
> >>>>>[jarod: manufacturing of changelog]
> >>>>>CC: Jay Vosburgh <j.vosburgh@xxxxxxxxx>
> >>>>>CC: Veaceslav Falico <vfalico@xxxxxxxxx>
> >>>>>CC: Andy Gospodarek <gospo@xxxxxxxxxxxxxxxxxxx>
> >>>>>CC: netdev@xxxxxxxxxxxxxxx
> >>>>>Signed-off-by: Uwe Koziolek <uwe.koziolek@xxxxxxxxxxx>
> >>>>>Signed-off-by: Jarod Wilson <jarod@xxxxxxxxxx>
> >>>>>---
> >>>>>drivers/net/bonding/bond_main.c | 5 +++++
> >>>>>1 file changed, 5 insertions(+)
> >>>>>
> >>>>>diff --git a/drivers/net/bonding/bond_main.c
> >>>>>b/drivers/net/bonding/bond_main.c
> >>>>>index 0c627b4..60b9483 100644
> >>>>>--- a/drivers/net/bonding/bond_main.c
> >>>>>+++ b/drivers/net/bonding/bond_main.c
> >>>>>@@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
> >>>>>*bond)
> >>>>> return should_notify_rtnl;
> >>>>> }
> >>>>>
> >>>>>+ if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
> >>>>>{
> >>>>>+ bond_arp_send_all(bond, curr_arp_slave);
> >>>>>+ return should_notify_rtnl;
> >>>>>+ }
> >>>>>+
> >>>>> bond_set_slave_inactive_flags(curr_arp_slave,
> >>>>>BOND_SLAVE_NOTIFY_LATER);
> >>>>>
> >>>>> bond_for_each_slave_rcu(bond, slave, iter) {
> >>>>>--
> >>>>>1.8.3.1
> >---
> > -Jay Vosburgh, jay.vosburgh@xxxxxxxxxxxxx
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/