Re: [PATCH bpf-next 1/2] bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign

From: Joe Stringer
Date: Fri May 26 2023 - 01:57:12 EST


On Thu, May 25, 2023 at 1:19 AM Lorenz Bauer <lmb@xxxxxxxxxxxxx> wrote:
>
> Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT
> sockets. This means we can't use the helper to steer traffic to Envoy, which
> configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing
> TPROXY from our setup.
>
> The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup
> helpers don't execute SK_REUSEPORT programs. Instead, one of the
> reuseport sockets is selected by hash. This could cause dispatch to the
> "wrong" socket:
>
> sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash
> bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed
>
> Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup
> helpers unfortunately. In the tc context, L2 headers are at the start
> of the skb, while SK_REUSEPORT expects L3 headers instead.
>
> Instead, we execute the SK_REUSEPORT program when the assigned socket
> is pulled out of the skb, further up the stack. This creates some
> trickiness with regards to refcounting as bpf_sk_assign will put both
> refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU
> freed. We can infer that the sk_assigned socket is RCU freed if the
> reuseport lookup succeeds, but convincing yourself of this fact isn't
> straight forward. Therefore we defensively check refcounting on the
> sk_assign sock even though it's probably not required in practice.
>
> Fixes: 8e368dc ("bpf: Fix use of sk->sk_reuseport from sk_assign")
> Fixes: cf7fbe6 ("bpf: Add socket assign support")
> Co-developed-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> Signed-off-by: Lorenz Bauer <lmb@xxxxxxxxxxxxx>
> Cc: Joe Stringer <joe@xxxxxxxxx>
> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@xxxxxxxxxxxxxx/

Nice approach to fix this issue, wish I'd thought of it :)

I pulled this and tested out in a little-vm-helper environment with
kind and Cilium's examples/kubernetes/connectivity-check proxy suite,
as well as cilium-cli's connectivity tests and the L7 features seem to
be working as expected with SO_REUSEPORT.

Tested-by: Joe Stringer <joe@xxxxxxxxx>

I also glanced through the commit, and the various protocols seem to
be handled consistently at the very least, though I agree it'd be
simpler for review and bisecting if broken down into more incremental
changes.