Re: [PATCH net] net: ensure all external references are released in deferred skbuffs

From: Eric Dumazet
Date: Wed Jun 22 2022 - 07:44:17 EST


On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets <i.maximets@xxxxxxx> wrote:
>
> On 6/22/22 12:36, Eric Dumazet wrote:
> > On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal <fw@xxxxxxxxx> wrote:
> >>
> >> Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> >>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@xxxxxxx> wrote:
> >>>>
> >>>> Open vSwitch system test suite is broken due to inability to
> >>>> load/unload netfilter modules. kworker thread is getting trapped
> >>>> in the infinite loop while running a net cleanup inside the
> >>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> >>>> holding nfct references and not being freed by their CPU cores.
> >>>>
> >>>> In general, the idea that we will have an rx interrupt on every
> >>>> CPU core at some point in a near future doesn't seem correct.
> >>>> Devices are getting created and destroyed, interrupts are getting
> >>>> re-scheduled, CPUs are going online and offline dynamically.
> >>>> Any of these events may leave packets stuck in defer list for a
> >>>> long time. It might be OK, if they are just a piece of memory,
> >>>> but we can't afford them holding references to any other resources.
> >>>>
> >>>> In case of OVS, nfct reference keeps the kernel thread in busy loop
> >>>> while holding a 'pernet_ops_rwsem' semaphore. That blocks the
> >>>> later modprobe request from user space:
> >>>>
> >>>> # ps
> >>>> 299 root R 99.3 200:25.89 kworker/u96:4+
> >>>>
> >>>> # journalctl
> >>>> INFO: task modprobe:11787 blocked for more than 1228 seconds.
> >>>> Not tainted 5.19.0-rc2 #8
> >>>> task:modprobe state:D
> >>>> Call Trace:
> >>>> <TASK>
> >>>> __schedule+0x8aa/0x21d0
> >>>> schedule+0xcc/0x200
> >>>> rwsem_down_write_slowpath+0x8e4/0x1580
> >>>> down_write+0xfc/0x140
> >>>> register_pernet_subsys+0x15/0x40
> >>>> nf_nat_init+0xb6/0x1000 [nf_nat]
> >>>> do_one_initcall+0xbb/0x410
> >>>> do_init_module+0x1b4/0x640
> >>>> load_module+0x4c1b/0x58d0
> >>>> __do_sys_init_module+0x1d7/0x220
> >>>> do_syscall_64+0x3a/0x80
> >>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >>>>
> >>>> At this point OVS testsuite is unresponsive and never recover,
> >>>> because these skbuffs are never freed.
> >>>>
> >>>> Solution is to make sure no external references attached to skb
> >>>> before pushing it to the defer list. Using skb_release_head_state()
> >>>> for that purpose. The function modified to be re-enterable, as it
> >>>> will be called again during the defer list flush.
> >>>>
> >>>> Another approach that can fix the OVS use-case, is to kick all
> >>>> cores while waiting for references to be released during the net
> >>>> cleanup. But that sounds more like a workaround for a current
> >>>> issue rather than a proper solution and will not cover possible
> >>>> issues in other parts of the code.
> >>>>
> >>>> Additionally checking for skb_zcopy() while deferring. This might
> >>>> not be necessary, as I'm not sure if we can actually have zero copy
> >>>> packets on this path, but seems worth having for completeness as we
> >>>> should never defer such packets regardless.
> >>>>
> >>>> CC: Eric Dumazet <edumazet@xxxxxxxxxx>
> >>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> >>>> Signed-off-by: Ilya Maximets <i.maximets@xxxxxxx>
> >>>> ---
> >>>> net/core/skbuff.c | 16 +++++++++++-----
> >>>> 1 file changed, 11 insertions(+), 5 deletions(-)
> >>>
> >>> I do not think this patch is doing the right thing.
> >>>
> >>> Packets sitting in TCP receive queues should not hold state that is
> >>> not relevant for TCP recvmsg().
> >>
> >> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
> >> not be possible to remove nf_conntrack module in practice.
> >
> > Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?
>
> Yeah, that is kind of the main problem I have with the current
> code. It's very hard to find all the cases where skb has to be
> cleaned up and almost impossible for someone who doesn't know
> every aspect of every network subsystem in the kernel. That's
> why I went with the more or less bulletproof approach of cleaning
> up while actually deferring. I can try and test the code you
> proposed in the other reply, but at least, I think, we need a
> bunch of debug warnings in the skb_attempt_defer_free() to catch
> possible issues.

Debug warnings are expensive if they need to bring new cache lines.

So far skb_attempt_defer_free() is only used by TCP in well known conditions.


>
> Also, what about cleaning up extensions? IIUC, at least one
> of them can hold external references. SKB_EXT_SEC_PATH holds
> xfrm_state. We should, probably, free them as well?

I do not know about this, I would ask XFRM maintainers

>
> And what about zerocopy skb? I think, we should still not
> allow them to be deferred as they seem to hold HW resources.

The point of skb_attempt_defer_free() i is to make the freeing happen
at producer
much instead of consumer.

I do not think there is anything special in this regard with zero
copy. I would leave the current code as is.

A simpler patch might be to move the existing nf_reset_ct() earlier,
can you test this ?

I note that IPv6 does the nf_reset_ct() earlier, from ip6_protocol_deliver_rcu()

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
struct sock *sk;
int ret;

+ nf_reset_ct(skb);
+
drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
if (skb->pkt_type != PACKET_HOST)
goto discard_it;
@@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (drop_reason)
goto discard_and_relse;

- nf_reset_ct(skb);
-
if (tcp_filter(sk, skb)) {
drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
goto discard_and_relse;