Re: [PATCH net] net: ensure all external references are released in deferred skbuffs
From: Ilya Maximets
Date: Fri Sep 22 2023 - 17:08:19 EST
On 9/22/23 21:07, Ilya Maximets wrote:
> On 9/22/23 15:26, Ilya Maximets wrote:
>> On 6/19/22 02:39, Ilya Maximets wrote:
>>> Open vSwitch system test suite is broken due to inability to
>>> load/unload netfilter modules. kworker thread is getting trapped
>>> in the infinite loop while running a net cleanup inside the
>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
>>> holding nfct references and not being freed by their CPU cores.
>>>
>>> In general, the idea that we will have an rx interrupt on every
>>> CPU core at some point in a near future doesn't seem correct.
>>> Devices are getting created and destroyed, interrupts are getting
>>> re-scheduled, CPUs are going online and offline dynamically.
>>> Any of these events may leave packets stuck in defer list for a
>>> long time. It might be OK, if they are just a piece of memory,
>>> but we can't afford them holding references to any other resources.
>>>
>>> In case of OVS, nfct reference keeps the kernel thread in busy loop
>>> while holding a 'pernet_ops_rwsem' semaphore. That blocks the
>>> later modprobe request from user space:
>>>
>>> # ps
>>> 299 root R 99.3 200:25.89 kworker/u96:4+
>>>
>>> # journalctl
>>> INFO: task modprobe:11787 blocked for more than 1228 seconds.
>>> Not tainted 5.19.0-rc2 #8
>>> task:modprobe state:D
>>> Call Trace:
>>> <TASK>
>>> __schedule+0x8aa/0x21d0
>>> schedule+0xcc/0x200
>>> rwsem_down_write_slowpath+0x8e4/0x1580
>>> down_write+0xfc/0x140
>>> register_pernet_subsys+0x15/0x40
>>> nf_nat_init+0xb6/0x1000 [nf_nat]
>>> do_one_initcall+0xbb/0x410
>>> do_init_module+0x1b4/0x640
>>> load_module+0x4c1b/0x58d0
>>> __do_sys_init_module+0x1d7/0x220
>>> do_syscall_64+0x3a/0x80
>>> entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>
>>> At this point OVS testsuite is unresponsive and never recover,
>>> because these skbuffs are never freed.
>>>
>>> Solution is to make sure no external references attached to skb
>>> before pushing it to the defer list. Using skb_release_head_state()
>>> for that purpose. The function modified to be re-enterable, as it
>>> will be called again during the defer list flush.
>>>
>>> Another approach that can fix the OVS use-case, is to kick all
>>> cores while waiting for references to be released during the net
>>> cleanup. But that sounds more like a workaround for a current
>>> issue rather than a proper solution and will not cover possible
>>> issues in other parts of the code.
>>>
>>> Additionally checking for skb_zcopy() while deferring. This might
>>> not be necessary, as I'm not sure if we can actually have zero copy
>>> packets on this path, but seems worth having for completeness as we
>>> should never defer such packets regardless.
>>>
>>> CC: Eric Dumazet <edumazet@xxxxxxxxxx>
>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
>>> Signed-off-by: Ilya Maximets <i.maximets@xxxxxxx>
>>> ---
>>
>> Happy Friday! :)
>>
>> Resurrecting this thread because I managed to reproduce the issue again
>> on a latest 6.6.0-rc1.
>>
>> (It doesn't mean we need to accept this particular patch, I just think
>> that it's an appropriate discussion thread.)
>>
>> It's a bit different testsuite this time. Last year I had an issue with
>> OVS testsuite, today I have an issue with OVN testsuite. Their structure
>> is very similar, but OVN tests are a fair bit more complex.
>>
>> The story is the same: Each test loads a pack of kernel modules including
>> OVS and nf_conntrack, sends some traffic, verifies OVN functionality,
>> stops OVN/OVS and unloads all the previously loaded modules to clean up
>> the space for the next tests.
>>
>> Kernel hangs in the same way as before waiting for nf_conntrack module
>> to be unloaded:
>>
>> 13 root R 100.0 933:17.98 kworker/u80:1+netns
>>
>> CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7
>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014
>> Workqueue: netns cleanup_net
>> RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>> Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8
>> ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e
>> 2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e
>> 2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00
>> RSP: 0018:ffffb23040073d98 EFLAGS: 00000202
>> RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38
>> RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920
>> RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38
>> R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080
>> R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0
>> FS: 0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>> <NMI>
>> ? nmi_cpu_backtrace+0x82/0xf0
>> ? nmi_cpu_backtrace_handler+0xd/0x20
>> ? nmi_handle+0x5e/0x150
>> ? default_do_nmi+0x40/0x100
>> ? exc_nmi+0x112/0x190
>> ? end_repeat_nmi+0x16/0x67
>> ? __pfx_kill_all+0x10/0x10 [nf_conntrack]
>> ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>> ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>> ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>> </NMI>
>> <TASK>
>> nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack]
>> cleanup_net+0x213/0x3b0
>> process_one_work+0x177/0x340
>> worker_thread+0x27e/0x390
>> ? __pfx_worker_thread+0x10/0x10
>> kthread+0xe2/0x110
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork+0x30/0x50
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork_asm+0x1b/0x30
>> </TASK>
>>
>> Since this testsuite is different from what I had issues with before,
>> I don't really know when the issue first appeared. It may have been
>> there even last year.
>>
>> I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but
>> I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9.
>> It doesn't necessarily mean that RHEL9 doesn't have the issue though,
>> the testsuite is not 100% reliable.
>>
>> I'll try to dig deeper and bisect the problem on the upstream kernel.
>>
>> For now it seems like the issue is likely in IPv6 code, because the
>> tests that trigger it are mostly IPv6-related.
>
> So, I bisected this down to an unsurprising commit:
>
> commit b0e214d212030fe497d4d150bb3474e50ad5d093
> Author: Madhu Koriginja <madhu.koriginja@xxxxxxx>
> Date: Tue Mar 21 21:28:44 2023 +0530
>
> netfilter: keep conntrack reference until IPsecv6 policy checks are done
>
> Keep the conntrack reference until policy checks have been performed for
> IPsec V6 NAT support, just like ipv4.
>
> The reference needs to be dropped before a packet is
> queued to avoid having the conntrack module unloadable.
>
> Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support")
> Signed-off-by: Madhu Koriginja <madhu.koriginja@xxxxxxx>
> Signed-off-by: Florian Westphal <fw@xxxxxxxxx>
>
> This commit repeated the pattern from the old ipv4 commit b59c270104f0
> ("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done")
> and hence introduced the exact same issue, but for IPv6. They both failed
> to deliver on the "avoid having the conntrack module unloadable" part.
>
> IIUC, the fix should be exactly the same as Eric did for ipv4 in commit
> 6f0012e35160 ("tcp: add a missing nf_reset_ct() in 3WHS handling").
>
> I can try that and send a fix.
Posted a fix here:
https://lore.kernel.org/netdev/20230922210530.2045146-1-i.maximets@xxxxxxx/
It fixes the problem I have with OVN testsuite.
>
> Would still really like to have some preventive mechanism for this kind of
> issues though. Any ideas on that?
>
> Best regards, Ilya Maximets.