Re: 6.18.13 iwlwifi deadlock allocating cma while work-item is active.

From: Hillf Danton

Date: Tue Mar 03 2026 - 22:11:20 EST


On Tue, 03 Mar 2026 12:49:24 +0100 Johannes Berg wrote:
>On Mon, 2026-03-02 at 07:50 -0800, Ben Greear wrote:
>> On 3/2/26 07:38, Johannes Berg wrote:
>> > On Mon, 2026-03-02 at 07:26 -0800, Ben Greear wrote:
>> > > >
>> > > > Was this with lockdep? If so, it complain about anything?
>> > > >
>> > > > I'm having a hard time seeing why it would deadlock at all when wifi
>> > > > uses schedule_work() and therefore the system_percpu_wq, and
>> > > > __lru_add_drain_all() flushes lru_add_drain_work on mm_percpu_wq, and
>> > > > lru_add_and_bh_lrus_drain() doesn't really _seem_ to do anything related
>> > > > to RTNL etc.?
>> > > >
>> > > > I think we need a real explanation here rather than "if I randomly
>> > > > change this, it no longer appears".
>> > >
>> > > The path where iwlwifi acquires CMA holds rtnl and/or wiphy locks before
>> > > allocating CMA memory, as expected.
>> > >
>> > > And the CMA allocation path attempts to flush the work queues in
>> > > at least some cases.
>> > >
>> > > If there is a work item queued that is trying to grab rtnl and/or wiphy lock
>> > > when CMA attempts to flush, then the flush work cannot complete, so it deadlocks.
>> > >
>> > > Lockdep doesn't warn about this.
>> >
>> > It really should, in cases where it can actually happen, I wrote the
>> > code myself for that... Though things have changed since, and the checks
>> > were lost at least once (and re-added), so I suppose it's possible that
>> > they were lost _again_, but the flushing system is far more flexible now
>> > and it's not flushing the same workqueue anyway, so it shouldn't happen.
>> >
>> > I stand by what I said before, need to show more precisely what depends
>> > on what, and I'm not going to accept a random kthread into this.
>>
>> My first email on the topic has process stack traces as well as lockdep
>> locks-held printout that points to the deadlock. I'm not sure what else to offer...please let me know
>> what you'd like to see.
>
> Fair. I don't know, I don't think there's anything that even shows that
> there's a dependency between the two workqueues and the
> "((wq_completion)events_unbound)" and "((wq_completion)events)", and
> there would have to be for it to deadlock this way because of that?
>
Given the locks held [1],

kworker/1:0/39480 kworker/u32:11/34989
rtnl_mutex
&rdev->wiphy.mtx
__lru_add_drain_all
flush_work(&per_cpu(lru_add_drain_work, cpu))
&rdev->wiphy.mtx

__if__ there is one work item queued __before__ one of the flush targets on
workqueue and it acquires the rtnl mutex, then no deadlock can rise,
because worker-xyz gets off CPU due to failing to take the rtnl lock then
worker-xyz+1 dequeus the flush target and completes it due to nothing
with rtnl. Same applies to the wiphy lock.

BTW any chance for queuing work that acquires rtnl lock on mm_percpu_wq?

[1] Subject: 6.18.13 iwlwifi deadlock allocating cma while work-item is active.
https://lore.kernel.org/linux-wireless/fa4e82ee-eb14-3930-c76c-f3bd59c5f258@xxxxxxxxxxxxxxx/