Re: kernel BUG at net/core/skbuff.c:4219

From: dracoding
Date: Thu Apr 11 2024 - 02:23:56 EST


From: Jeremi Piotrowski <jpiotrowski@xxxxxxxxxxxxxxxxxxx>

> On Tue, Oct 11, 2022 at 10:57:05AM -0700, Eric Dumazet wrote:
> >
> > On 10/11/22 09:56, Jeremi Piotrowski wrote:
> > >Hi,
> > >
> > >One of our Flatcar users has been hitting the kernel BUG in the subject line
> > >for the past year (https://github.com/flatcar/Flatcar/issues/378). This was
> > >first reported when on 5.10.25, but has been happening across kernel updates,
> > >most recently with 5.15.63. The nodes where this happens are AWS EC2 instances,
> > >using ENA and calico networking in eBPF mode with VXLAN encapsulation. When
> > >GRO/GSO is enabled, the host hits this bug and prints the following stacktrace:
> >
> >
> > I suspect eBPF code lowers gso_size ?
> >
> > gso stack is not able to arbitrarily segment a GRO packet after
> > gso_size being changed.
> >
> >
>
> This was a good hint, see Tomas' response for some more observations.
>
> This appears to still be happening with Calico v3.23 which started passing
> BPF_F_ADJ_ROOM_FIXED_GSO to bpf_skb_adjust_room() on the decap (rx) path.
> BPF_F_ADJ_ROOM_FIXED_GSO is not passed on the encap (tx) path. It is enough to
> disable GRO to stop the BUG from being hit though, so there must be more going
> on here ? (since the rx path does not change gso_size any longer).
>

Hi,

I encountered a similar error. The calico version is v3.24.5.
It was crash at BUG_ON(skb_headlen(list_skb) > len) with the following stacktrace.
But i don't konw how to reproduce it.

[exception RIP: skb_segment+3016]
RIP: ffffffffb97df2a8 RSP: ffffa3f2cce08728 RFLAGS: 00010293
RAX: 000000000000007d RBX: 00000000fffff7b3 RCX: 0000000000000011
RDX: 0000000000000000 RSI: ffff895ea32c76c0 RDI: 00000000000008c1
RBP: ffffa3f2cce087f8 R8: 000000000000088f R9: 0000000000000011
R10: 000000000000090c R11: ffff895e47e68000 R12: ffff895eb2022f00
R13: 000000000000004b R14: ffff895ecdaf2000 R15: ffff895eb2023f00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63
#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320
#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3
#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0
#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741
#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59
#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471
#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0
#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741
#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e
#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e
#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614
#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030
#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8
#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd
#24 [ffffa3f2cce08bb8] __netif_receive_skb_core at ffffffffb97f6585
#25 [ffffa3f2cce08c68] __netif_receive_skb_list_core at ffffffffb97f6c0a
#26 [ffffa3f2cce08ce8] netif_receive_skb_list_internal at ffffffffb97f6f6a
#27 [ffffa3f2cce08d60] gro_normal_list at ffffffffb97f717e
#28 [ffffa3f2cce08d80] gro_normal_one at ffffffffb97f721c
#29 [ffffa3f2cce08db8] napi_gro_complete at ffffffffb97f72ac
#30 [ffffa3f2cce08de0] napi_gro_flush at ffffffffb97f73c1
#31 [ffffa3f2cce08e30] napi_complete_done at ffffffffb97f7d1e
#32 [ffffa3f2cce08e60] ice_napi_poll at ffffffffc0477dd6 [ice]
#33 [ffffa3f2cce08ec0] __napi_poll at ffffffffb97f823e
#34 [ffffa3f2cce08ef0] net_rx_action at ffffffffb97f86f1
#35 [ffffa3f2cce08f70] __softirqentry_text_start at ffffffffb9e000dd
#36 [ffffa3f2cce08fd8] irq_exit_rcu at ffffffffb9096074
#37 [ffffa3f2cce08ff0] common_interrupt at ffffffffb9a3272a

the gso_size is 75 which may subtract 50(the vxlan head length) by bpf_skb_adjust_room?ã??
the frag_list has one element which head_frag is 1. the skb_shared_info struct is as following.

struct skb_shared_info {
nr_frags = 17 '\021', 
gso_size = 75, 
gso_segs = 0, 
frag_list = 0xffff895eb2022f00, 
gso_type = 1035, 
destructor_arg = 0x2d656c6261747372, 
frags = {{
    bv_page = 0xfffff80e86d4d180, 
    bv_len = 125, 
    bv_offset = 2306
  },
....
}
}

If anyone has any suggestions excepth disabling GRO/GSO. The BPF_F_ADJ_ROOM_FIXED_GSO flag
can be enabled on the encap path? Iâ??d love to provide more information if you need.

fred