Re: [bonding][patch] Regarding a bonding lacp issue
From: Jay Vosburgh
Date: Thu Aug 08 2019 - 15:17:35 EST
Felix <fei.feng@xxxxxxxxxxxxxxxxx> wrote:
>Dear Mainteners,
>
>Recently I hit a packet drop issue in bonding driver on Linux 4.9. Please
>see details below. Please take a look to see if my understanding is
>correct. Many thanks.
>
>What is the problem?
>The bonding driver starts to send packets even if the Partner(Switch)'s
>Collecting bit is not enabled yet. Partner would drop all packets until
>its Collecting bit is enabled.
>
>What is the root cuase?
>According to LACP spec, the Actor need to check Partner's Sync and
>Collecting bits before enable its Distributing bit and Distributing
>function. Please see the PIC below.
The diagram you reference is found in 802.1AX-2014 figure 6-21,
which shows the state diagram for an independent control implementation,
i.e., collecting and distributing are managed independently.
However, Linux bonding implements coupled control, which is
shown in figure 6-22. Here, there is no Partner.Collecting requirement
on the state transition from ATTACHED to COLLECTING_DISTRIBUTING.
To quote 802.1AX-2014 6.4.15:
As independent control is not possible, the coupled control
state machine does not wait for the Partner to signal that
collection has started before enabling both collection and
distribution.
Now, that said, I agree that what you're seeing is likely
explained by this behavior, and your fix should resolve the immediate
problem (that bonding sends packets before the peer has enabled
COLLECTING).
However, your fix does put bonding out of compliance with the
standard, as it does not really implement COLLECTING and DISTRIBUTING as
discrete states. In particular, if the peer in your case were to later
clear Partner.Collecting, bonding will not react to this as a figure
6-21 independent control implementation would (which isn't a change from
current behavior, but currently this isn't expected).
So, in my opinion a patch like this should have a comment
attached noting that we are deliberately not in compliance with the
standard in this specific situation. The proper fix is to implement
figure 6-21 separate state.
Lastly, are you able to test and generate a patch against
current upstream, instead of 4.9?
-J
>How to fix?
>Please see the diff as following. And the patch is attached.
>
>--- ../origin/linux-4.9.188/drivers/net/bonding/bond_3ad.c 2019-08-07
>00:29:42.000000000 +0800
>+++ drivers/net/bonding/bond_3ad.c 2019-08-08 23:13:29.015640197 +0800
>@@ -937,6 +937,7 @@
> */
> if ((port->sm_vars & AD_PORT_SELECTED) &&
> (port->partner_oper.port_state & AD_STATE_SYNCHRONIZATION) &&
>+ (port->partner_oper.port_state & AD_STATE_COLLECTING) &&
> !__check_agg_selection_timer(port)) {
> if (port->aggregator->is_active)
> port->sm_mux_state =
>
>------
>Thanks,
>Felix
---
-Jay Vosburgh, jay.vosburgh@xxxxxxxxxxxxx