Re: 6.18.13 iwlwifi deadlock allocating cma while work-item is active.
From: Ben Greear
Date: Tue Mar 03 2026 - 16:41:26 EST
On 3/3/26 13:12, Johannes Berg wrote:
On Tue, 2026-03-03 at 10:52 -1000, Tejun Heo wrote:
Hello,
On Tue, Mar 03, 2026 at 12:49:24PM +0100, Johannes Berg wrote:
Fair. I don't know, I don't think there's anything that even shows that
there's a dependency between the two workqueues and the
"((wq_completion)events_unbound)" and "((wq_completion)events)", and
there would have to be for it to deadlock this way because of that?
But one is mm_percpu_wq and the other is system_percpu_wq.
Tejun, does the workqueue code somehow introduce a dependency between
different per-CPU workqueues that's not modelled in lockdep?
Hopefully not. Kinda late to the party. Why isn't mm_percpu_wq making
forward progress? That should in all circumstances. What's the work item and
kworker doing?
Oh and in addition: the worker that's kicked off by
__lru_add_drain_all() doesn't really seem to do anything long-running?
It's lru_add_drain_per_cpu(), which is lru_add_and_bh_lrus_drain(),
which would appear to be entirely non-sleepable code (holding either
local locks or having irqs disabled.) It also doesn't show up in the
log, apparently, hence my question about strange dependencies.
Hello Tejun,
If I use a kthread to do the blocking reg_todo work, then the problem
goes away, so it somehow does appear that the work flush logic down in swap.c
is somehow being blocked by the reg_todo work item, not just the swap.c
logic somehow blocking against itself.
My kthread hack left the reg_todo work item logic in place, but instead of
the work item doing any blocking work, it instead just wakes the kthread
I added and has that kthread do the work under mutex.
The second regulatory related work item in net/wireless/ causes the same
lockup, though it was harder to reproduce. Putting that work in the kthread
also seems to have fixed it.
I could only ever reproduce this with KASAN (and lockdep and other debugging options
enabled), my guess is that this is because then the system runs slower and/or there
is more memory pressure.
I should still be able to reproduce this if I switch to upstream kernel, so
if there is any debugging code you'd like me to execute, I will attempt to
do so.
Thanks,
Ben
--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc http://www.candelatech.com