Re: 6.18.13 iwlwifi deadlock allocating cma while work-item is active.
From: Johannes Berg
Date: Tue Mar 03 2026 - 07:02:36 EST
On Mon, 2026-03-02 at 07:50 -0800, Ben Greear wrote:
> On 3/2/26 07:38, Johannes Berg wrote:
> > On Mon, 2026-03-02 at 07:26 -0800, Ben Greear wrote:
> > >
> > > >
> > > > Was this with lockdep? If so, it complain about anything?
> > > >
> > > > I'm having a hard time seeing why it would deadlock at all when wifi
> > > > uses schedule_work() and therefore the system_percpu_wq, and
> > > > __lru_add_drain_all() flushes lru_add_drain_work on mm_percpu_wq, and
> > > > lru_add_and_bh_lrus_drain() doesn't really _seem_ to do anything related
> > > > to RTNL etc.?
> > > >
> > > > I think we need a real explanation here rather than "if I randomly
> > > > change this, it no longer appears".
> > >
> > > The path where iwlwifi acquires CMA holds rtnl and/or wiphy locks before
> > > allocating CMA memory, as expected.
> > >
> > > And the CMA allocation path attempts to flush the work queues in
> > > at least some cases.
> > >
> > > If there is a work item queued that is trying to grab rtnl and/or wiphy lock
> > > when CMA attempts to flush, then the flush work cannot complete, so it deadlocks.
> > >
> > > Lockdep doesn't warn about this.
> >
> > It really should, in cases where it can actually happen, I wrote the
> > code myself for that... Though things have changed since, and the checks
> > were lost at least once (and re-added), so I suppose it's possible that
> > they were lost _again_, but the flushing system is far more flexible now
> > and it's not flushing the same workqueue anyway, so it shouldn't happen.
> >
> > I stand by what I said before, need to show more precisely what depends
> > on what, and I'm not going to accept a random kthread into this.
>
> My first email on the topic has process stack traces as well as lockdep
> locks-held printout that points to the deadlock. I'm not sure what else to offer...please let me know
> what you'd like to see.
Fair. I don't know, I don't think there's anything that even shows that
there's a dependency between the two workqueues and the
"((wq_completion)events_unbound)" and "((wq_completion)events)", and
there would have to be for it to deadlock this way because of that?
But one is mm_percpu_wq and the other is system_percpu_wq.
Tejun, does the workqueue code somehow introduce a dependency between
different per-CPU workqueues that's not modelled in lockdep?
johannes