Re: [RESEND PATCH net-next v4] bond: add mac filter option for balance-xor

From: Nikolay Aleksandrov
Date: Wed Jul 13 2022 - 00:07:03 EST


On 7/11/22 15:07, Jonathan Toppins wrote:
Implement a MAC filter that prevents duplicate frame delivery when
handling BUM traffic. This attempts to partially replicate OvS SLB
Bonding[1] like functionality without requiring significant change
in the Linux bridging code.

A typical network setup for this feature would be:

.--------------------------------------------.
| .--------------------. |
| | | |
.-------------------. | |
| | Bond 0 | | | |
| .--'---. .---'--. | | |
.----|-| eth0 |-| eth1 |-|----. .-----+----. .----+------.
| | '------' '------' | | | Switch 1 | | Switch 2 |
| '---,---------------' | | +---+ |
| / | '----+-----' '----+------'
| .---'---. .------. | | |
| | br0 |----| VM 1 | | ~~~~~~~~~~~~~~~~~~~~~
| '-------' '------' | ( )
| | .------. | ( Rest of Network )
| '--------| VM # | | (_____________________)
| '------' |
| Host 1 |
'-----------------------------'

Where 'VM1' and 'VM#' are hosts connected to a Linux bridge, br0, with
bond0 and its associated links, eth0 & eth1, provide ingress/egress. One
can assume bond0, br1, and hosts VM1 to VM# are all contained in a
single box, as depicted. Interfaces eth0 and eth1 provide redundant
connections to the data center with the requirement to use all bandwidth
when the system is functioning normally. Switch 1 and Switch 2 are
physical switches that do not implement any advanced L2 management
features such as MLAG, Cisco's VPC, or LACP.

Combining this feature with vlan+srcmac hash policy allows a user to
create an access network without the need to use expensive switches that
support features like Cisco's VCP.

[1] https://docs.openvswitch.org/en/latest/topics/bonding/#slb-bonding

Co-developed-by: Long Xin <lxin@xxxxxxxxxx>
Signed-off-by: Long Xin <lxin@xxxxxxxxxx>
Signed-off-by: Jonathan Toppins <jtoppins@xxxxxxxxxx>
---

Notes:
v2:
* dropped needless abstraction functions and put code in module init
* renamed variable "rc" to "ret" to stay consistent with most of the
code
* fixed parameter setting management, when arp-monitor is turned on
this feature will be turned off similar to how miimon and arp-monitor
interact
* renamed bond_xor_recv to bond_mac_filter_recv for a little more
clarity
* it appears the implied default return code for any bonding recv probe
must be `RX_HANDLER_ANOTHER`. Changed the default return code of
bond_mac_filter_recv to use this return value to not break skb
processing when the skb dev is switched to the bond dev:
`skb->dev = bond->dev`
v3: Nik's comments
* clarified documentation
* fixed inline and basic reverse Christmas tree formatting
* zero'ed entry in mac_create
* removed read_lock taking in bond_mac_filter_recv
* made has_expired() atomic and removed critical sections
surrounding calls to has_expired(), this also removed the
use-after-free that would have occurred:
spin_lock_irqsave(&entry->lock, flags);
if (has_expired(bond, entry))
mac_delete(bond, entry);
spin_unlock_irqrestore(&entry->lock, flags); <---
* moved init/destroy of mac_filter_tbl to bond_open/bond_close
this removed the complex option dependencies, the only behavioural
change the user will see is if the bond is up and mac_filter is
enabled if they try and set arp_interval they will receive -EBUSY
* in bond_changelink moved processing of mac_filter option just below
mode processing
v4:
* rebase to latest net-next
* removed rcu_read_lock() call in bond_mac_insert()
* used specific spin_lock_{}() calls instead of the irqsave version
* Outstanding comments from Nik:
https://lore.kernel.org/netdev/4c9db6ac-aa24-2ca2-3e44-18cfb23ac1bc@xxxxxxxxxxxxx/
- old version of the patch still under discussion
https://lore.kernel.org/netdev/d2696dab-2490-feb5-ccb2-96906fc652f0@xxxxxxxxxx/
* response: it has been over a month now and no new comments have come in
so I am assuming there is nothing left to the discussion

Not really, I didn't have time to adjust my solution, but it's not hard
to solve the multicast ingress. More below.

- What if anyone decides at some point 5 seconds are not enough or too much?
* response: I think making that configurable at a later time should not
prevent the inclusion of the initial feature. > - This bit is pointless, you still have races even though not
critical
* response: if there are races please point them out simply making a
comment about an enum doesn't show the race.

You had race conditions about the flag, now that you always take the lock you no longer have them, but also the flag is useless.

- This is not scalable at all, you get the lock even to update the
expire on *every* packet, most users go for L3 or L3+L4 and will hit
this with different flows on different CPUs.
* response: we take the lock to update the value, so that receive
traffic is not dropped. Further any implementation, bpf or nft,
will also have to take locks to update MAC entries. If there is a
specific locking order that will be less bad, I would appreciate
that discussion. Concerning L3 or L3+L4 hashing this is not the
data I have, in fact the most utilized hash is layer2.


Yes, there are more scalable alternatives that are already implemented
in nft and eBPF and you can get them for free if you go with some of the
options below.

To the point, tbh I'm amazed we're still discussing applying this change :)

anyway here are my suggestions:

a) if you insist on adding some new code then at least add only
what is missing, i.e. add a xor mode option to limit ingress mcast only
to the active slave, then use the nftables mac filtering solution that I
gave you and you'll have the best of both worlds, you'll be able to dump
the current macs, edit them if needed, control the filtering time and
have the full nft flexibility, also it will scale much better than this

b) write a 3-4 line bash script that allows mcast only through the active slave, and again use the nft solution I gave you for the rest

c) extend the eBPF program I wrote to do mcast filtering, again it'll scale much better

(tentative) d) figure out a way to solve the mcast ingress entirely with nft, I suspect it's possible but I don't have time to waste to prove it

I don't plan on reviewing the patch in this form because I don't think
it should be applied at all since all of this can be easily implemented
with the available tools, but I won't officially nack it either.
That would be the maintainers' decision.

Cheers,
Nik