Re: [PATCH bpf-next v3 0/4] xdp: recycle Page Pool backed skbs built from XDP frames

From: Ilya Leoshkevich
Date: Wed Mar 15 2023 - 14:27:29 EST


On Wed, 2023-03-15 at 19:12 +0100, Alexander Lobakin wrote:
> From: Ilya Leoshkevich <iii@xxxxxxxxxxxxx>
> Date: Wed, 15 Mar 2023 19:00:47 +0100
>
> > On Wed, 2023-03-15 at 15:54 +0100, Ilya Leoshkevich wrote:
> > > On Wed, 2023-03-15 at 11:54 +0100, Alexander Lobakin wrote:
> > > > From: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>
> > > > Date: Wed, 15 Mar 2023 10:56:25 +0100
> > > >
> > > > > From: Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx>
> > > > > Date: Tue, 14 Mar 2023 16:54:25 -0700
> > > > >
> > > > > > On Tue, Mar 14, 2023 at 11:52 AM Alexei Starovoitov
> > > > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > test_xdp_do_redirect:PASS:prog_run 0 nsec
> > > > > > test_xdp_do_redirect:PASS:pkt_count_xdp 0 nsec
> > > > > > test_xdp_do_redirect:PASS:pkt_count_zero 0 nsec
> > > > > > test_xdp_do_redirect:FAIL:pkt_count_tc unexpected
> > > > > > pkt_count_tc:
> > > > > > actual
> > > > > > 220 != expected 9998
> > > > > > test_max_pkt_size:PASS:prog_run_max_size 0 nsec
> > > > > > test_max_pkt_size:PASS:prog_run_too_big 0 nsec
> > > > > > close_netns:PASS:setns 0 nsec
> > > > > > #289 xdp_do_redirect:FAIL
> > > > > > Summary: 270/1674 PASSED, 30 SKIPPED, 1 FAILED
> > > > > >
> > > > > > Alex,
> > > > > > could you please take a look at why it's happening?
> > > > > >
> > > > > > I suspect it's an endianness issue in:
> > > > > >         if (*metadata != 0x42)
> > > > > >                 return XDP_ABORTED;
> > > > > > but your patch didn't change that,
> > > > > > so I'm not sure why it worked before.
> > > > >
> > > > > Sure, lemme fix it real quick.
> > > >
> > > > Hi Ilya,
> > > >
> > > > Do you have s390 testing setups? Maybe you could take a look,
> > > > since
> > > > I
> > > > don't have one and can't debug it? Doesn't seem to be
> > > > Endianness
> > > > issue.
> > > > I mean, I have this (the below patch), but not sure it will fix
> > > > anything -- IIRC eBPF arch always matches the host arch ._.
> > > > I can't figure out from the code what does happen wrongly :s
> > > > And it
> > > > happens only on s390.
> > > >
> > > > Thanks,
> > > > Olek
> > > > ---
> > > > diff --git
> > > > a/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
> > > > b/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
> > > > index 662b6c6c5ed7..b21371668447 100644
> > > > --- a/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
> > > > +++ b/tools/testing/selftests/bpf/prog_tests/xdp_do_redirect.c
> > > > @@ -107,7 +107,7 @@ void test_xdp_do_redirect(void)
> > > >                             .attach_point = BPF_TC_INGRESS);
> > > >  
> > > >         memcpy(&data[sizeof(__u32)], &pkt_udp,
> > > > sizeof(pkt_udp));
> > > > -       *((__u32 *)data) = 0x42; /* metadata test value */
> > > > +       *((__u32 *)data) = htonl(0x42); /* metadata test value
> > > > */
> > > >  
> > > >         skel = test_xdp_do_redirect__open();
> > > >         if (!ASSERT_OK_PTR(skel, "skel"))
> > > > diff --git
> > > > a/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c
> > > > b/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c
> > > > index cd2d4e3258b8..2475bc30ced2 100644
> > > > --- a/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c
> > > > +++ b/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c
> > > > @@ -1,5 +1,6 @@
> > > >  // SPDX-License-Identifier: GPL-2.0
> > > >  #include <vmlinux.h>
> > > > +#include <bpf/bpf_endian.h>
> > > >  #include <bpf/bpf_helpers.h>
> > > >  
> > > >  #define ETH_ALEN 6
> > > > @@ -28,7 +29,7 @@ volatile int retcode = XDP_REDIRECT;
> > > >  SEC("xdp")
> > > >  int xdp_redirect(struct xdp_md *xdp)
> > > >  {
> > > > -       __u32 *metadata = (void *)(long)xdp->data_meta;
> > > > +       __be32 *metadata = (void *)(long)xdp->data_meta;
> > > >         void *data_end = (void *)(long)xdp->data_end;
> > > >         void *data = (void *)(long)xdp->data;
> > > >  
> > > > @@ -44,7 +45,7 @@ int xdp_redirect(struct xdp_md *xdp)
> > > >         if (metadata + 1 > data)
> > > >                 return XDP_ABORTED;
> > > >  
> > > > -       if (*metadata != 0x42)
> > > > +       if (*metadata != __bpf_htonl(0x42))
> > > >                 return XDP_ABORTED;
> > > >  
> > > >         if (*payload == MARK_XMIT)
> > >
> > > Okay, I'll take a look. Two quick observations for now:
> > >
> > > - Unfortunately the above patch does not help.
> > >
> > > - In dmesg I see:
> > >
> > >     Driver unsupported XDP return value 0 on prog xdp_redirect
> > > (id
> > > 23)
> > >     dev N/A, expect packet loss!
> >
> > I haven't identified the issue yet, but I have found a couple more
> > things that might be helpful:
> >
> > - In problematic cases metadata contains 0, so this is not an
> >   endianness issue. data is still reasonable though. I'm trying to
> >   understand what is causing this.
> >
> > - Applying the following diff:
> >
> > --- a/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c
> > +++ b/tools/testing/selftests/bpf/progs/test_xdp_do_redirect.c
> > @@ -52,7 +52,7 @@ int xdp_redirect(struct xdp_md *xdp)
> >  
> >         *payload = MARK_IN;
> >  
> > -       if (bpf_xdp_adjust_meta(xdp, 4))
> > +       if (false && bpf_xdp_adjust_meta(xdp, 4))
> >                 return XDP_ABORTED;
> >  
> >         if (retcode > XDP_PASS)
> >
> > causes a kernel panic even on x86_64:
> >
> > BUG: kernel NULL pointer dereference, address:
> > 0000000000000d28      
> > ...
> > Call Trace:           
> >  <TASK>                                                            
> >   
> >  build_skb_around+0x22/0xb0
> >  __xdp_build_skb_from_frame+0x4e/0x130
> >  bpf_test_run_xdp_live+0x65f/0x7c0
> >  ? __pfx_xdp_test_run_init_page+0x10/0x10
> >  bpf_prog_test_run_xdp+0x2ba/0x480
> >  bpf_prog_test_run+0xeb/0x110
> >  __sys_bpf+0x2b9/0x570
> >  __x64_sys_bpf+0x1c/0x30
> >  do_syscall_64+0x48/0xa0
> >  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> >
> > I haven't looked into this at all, but I believe this needs to be
> > fixed - BPF should never cause kernel panics.
>
> This one is basically the same issue as syzbot mentioned today
> (separate
> subthread). I'm waiting for a feedback from Toke on which way of
> fixing
> he'd prefer (I proposed 2). If those zeroed metadata magics that you
> observe have the same roots with the panic, one fix will smash 2
> issues.
>
> Thanks,
> Olek

Sounds good, I will wait for an update then.

In the meantime, I found the code that overwrites the metadata:

#0 0x0000000000aaeee6 in neigh_hh_output (hh=0x83258df0,
skb=0x88142200) at linux/include/net/neighbour.h:503
#1 0x0000000000ab2cda in neigh_output (skip_cache=false,
skb=0x88142200, n=<optimized out>) at linux/include/net/neighbour.h:544
#2 ip6_finish_output2 (net=net@entry=0x88edba00, sk=sk@entry=0x0,
skb=skb@entry=0x88142200) at linux/net/ipv6/ip6_output.c:134
#3 0x0000000000ab4cbc in __ip6_finish_output (skb=0x88142200, sk=0x0,
net=0x88edba00) at linux/net/ipv6/ip6_output.c:195
#4 ip6_finish_output (net=0x88edba00, sk=0x0, skb=0x88142200) at
linux/net/ipv6/ip6_output.c:206
#5 0x0000000000ab5cbc in dst_input (skb=<optimized out>) at
linux/include/net/dst.h:454
#6 ip6_sublist_rcv_finish (head=head@entry=0x38000dbf520) at
linux/net/ipv6/ip6_input.c:88
#7 0x0000000000ab6104 in ip6_list_rcv_finish (net=<optimized out>,
head=<optimized out>, sk=0x0) at linux/net/ipv6/ip6_input.c:145
#8 0x0000000000ab72bc in ipv6_list_rcv (head=0x38000dbf638,
pt=<optimized out>, orig_dev=<optimized out>) at
linux/net/ipv6/ip6_input.c:354
#9 0x00000000008b3710 in __netif_receive_skb_list_ptype
(orig_dev=0x880b8000, pt_prev=0x176b7f8 <ipv6_packet_type>,
head=0x38000dbf638) at linux/net/core/dev.c:5520
#10 __netif_receive_skb_list_core (head=head@entry=0x38000dbf7b8,
pfmemalloc=pfmemalloc@entry=false) at linux/net/core/dev.c:5568
#11 0x00000000008b4390 in __netif_receive_skb_list (head=0x38000dbf7b8)
at linux/net/core/dev.c:5620
#12 netif_receive_skb_list_internal (head=head@entry=0x38000dbf7b8) at
linux/net/core/dev.c:5711
#13 0x00000000008b45ce in netif_receive_skb_list
(head=head@entry=0x38000dbf7b8) at linux/net/core/dev.c:5763
#14 0x0000000000950782 in xdp_recv_frames (dev=<optimized out>,
skbs=<optimized out>, nframes=62, frames=0x8587c600) at
linux/net/bpf/test_run.c:256
#15 xdp_test_run_batch (xdp=xdp@entry=0x38000dbf900,
prog=prog@entry=0x37fffe75000, repeat=<optimized out>) at
linux/net/bpf/test_run.c:334

namely:

static inline int neigh_hh_output(const struct hh_cache *hh, struct
sk_buff *skb)
...
memcpy(skb->data - HH_DATA_MOD, hh->hh_data, HH_DATA_MOD);

It's hard for me to see what is going on here, since I'm not familiar
with the networking code - since XDP metadata is located at the end of
headroom, should not there be something that prevents the network stack
from overwriting it? Or can it be that netif_receive_skb_list() is
free to do whatever it wants with that memory and we cannot expect to
receive it back intact?