Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

From: Vineeth Pillai

Date: Thu Apr 16 2026 - 21:18:55 EST

Consolidating replies into one thread.

Hi Kunwu,

> One thing that is still unclear is dispatch behavior:
> `process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.
>
> So the key question is: what prevents pending work from being dispatched on that pwq?
> Is it due to:
> 1) pwq stalled/hung state,
> 2) worker availability/affinity constraints,
> 3) or another dispatch-side condition?
>
> Also, for scope:
> - your crash instances consistently show the shutdown path
> (irqfd_resampler_shutdown + synchronize_srcu),
> - while assign-path evidence, per current thread data, appears to come
> from a separate stress case.

> A time-aligned dump with pwq state, pending/in-flight lists, and worker states
> should help clarify this.

I have a dmesg log showing this issue. This is from an automated stress
reboot test. The log is very similar to what Sonam shared.

<0>[ 434.338427] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 293s!
<6>[ 434.339037] Showing busy workqueues and worker pools:
<6>[ 434.339387] workqueue events: flags=0x100
<6>[ 434.339667] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=2 refcnt=3
<6>[ 434.339691] pending: 2*xhci_dbc_handle_events
<6>[ 434.340512] workqueue events: flags=0x100
<6>[ 434.340789] pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.340793] pending: vmstat_shepherd
<6>[ 434.341507] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=45 refcnt=46
<6>[ 434.341511] pending: delayed_vfree_work, kernfs_notify_workfn, 5*destroy_super_work, 3*bpf_prog_free_deferred, 5*destroy_super_work, binder_deferred_func, bpf_prog_free_deferred, 25*destroy_super_work, drain_local_memcg_stock, update_stats_workfn, psi_avgs_work
<6>[ 434.343578] pwq 30: cpus=7 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.343582] in-flight: 325:do_emergency_remount
<6>[ 434.344376] workqueue events_unbound: flags=0x2
<6>[ 434.344688] pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=3
<6>[ 434.344693] in-flight: 339:fsnotify_connector_destroy_workfn fsnotify_connector_destroy_workfn
<6>[ 434.345755] pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=2 refcnt=8
<6>[ 434.345759] in-flight: 153:fsnotify_mark_destroy_workfn BAR(3098) BAR(2564) BAR(2299) fsnotify_mark_destroy_workfn BAR(416) BAR(1116)
<6>[ 434.347151] workqueue events_freezable: flags=0x104
<6>[ 434.347590] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.347595] pending: pci_pme_list_scan
<6>[ 434.348681] workqueue events_power_efficient: flags=0x180
<6>[ 434.349221] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.349226] pending: check_lifetime
<6>[ 434.350397] workqueue rcu_gp: flags=0x108
<6>[ 434.350853] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
<6>[ 434.350857] pending: 3*process_srcu
<6>[ 434.351918] workqueue slub_flushwq: flags=0x8
<6>[ 434.352409] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=3
<6>[ 434.352413] pending: flush_cpu_slab BAR(1)
<6>[ 434.353529] workqueue mm_percpu_wq: flags=0x8
<6>[ 434.354087] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
<6>[ 434.354092] pending: vmstat_update
<6>[ 434.355205] workqueue quota_events_unbound: flags=0xa
<6>[ 434.355725] pwq 34: cpus=0-7 node=0 flags=0x4 nice=0 active=1 refcnt=3
<6>[ 434.355730] in-flight: 354:quota_release_workfn BAR(325)
<6>[ 434.356980] workqueue kvm-irqfd-cleanup: flags=0x0
<6>[ 434.357582] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=3 refcnt=4
<6>[ 434.357586] in-flight: 51:irqfd_shutdown ,3453:irqfd_shutdown ,3449:irqfd_shutdown
<6>[ 434.359101] pool 22: cpus=5 node=0 flags=0x0 nice=0 hung=293s workers=11 idle: 282 154 3452 3451 3448 3450 3455 3454
<6>[ 434.359989] pool 30: cpus=7 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3460 332
<6>[ 434.360539] pool 34: cpus=0-7 node=0 flags=0x4 nice=0 hung=0s workers=5 idle: 256 66

The relevant pwq is pwq 22. All three irqfd_shutdown workers are in-flight
but in D state. rcu_gp's process_srcu items are stuck pending.

Worker 51 (kworker/5:0) — blocked acquiring resampler_lock:
<6>[ 440.576612] task:kworker/5:0 state:D stack:0 pid:51 tgid:51 ppid:2 task_flags:0x4208060 flags:0x00080000
<6>[ 440.577379] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[ 440.578085] <TASK>
<6>[ 440.578337] preempt_schedule_irq+0x4a/0x90
<6>[ 440.583712] __mutex_lock+0x413/0xe40
<6>[ 440.583969] irqfd_resampler_shutdown+0x23/0x150
<6>[ 440.584288] irqfd_shutdown+0x66/0xc0
<6>[ 440.584546] process_scheduled_works+0x219/0x450
<6>[ 440.584864] worker_thread+0x2a7/0x3b0
<6>[ 440.585421] kthread+0x230/0x270

Worker 3449 (kworker/5:4) — same, blocked acquiring resampler_lock:
<6>[ 440.671294] task:kworker/5:4 state:D stack:0 pid:3449 tgid:3449 ppid:2 task_flags:0x4208060 flags:0x00080000
<6>[ 440.672088] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[ 440.672662] <TASK>
<6>[ 440.673069] schedule+0x5e/0xe0
<6>[ 440.673708] __mutex_lock+0x413/0xe40
<6>[ 440.674059] irqfd_resampler_shutdown+0x23/0x150
<6>[ 440.674381] irqfd_shutdown+0x66/0xc0
<6>[ 440.674638] process_scheduled_works+0x219/0x450
<6>[ 440.674956] worker_thread+0x2a7/0x3b0
<6>[ 440.675308] kthread+0x230/0x270

Worker 3453 (kworker/5:8) — holds resampler_lock, blocked waiting for SRCU GP:
<6>[ 440.677368] task:kworker/5:8 state:D stack:0 pid:3453 tgid:3453 ppid:2 task_flags:0x4208060 flags:0x00080000
<6>[ 440.678185] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
<6>[ 440.678720] <TASK>
<6>[ 440.679127] schedule+0x5e/0xe0
<6>[ 440.679354] schedule_timeout+0x2e/0x130
<6>[ 440.680084] wait_for_common+0xf7/0x1f0
<6>[ 440.680355] synchronize_srcu_expedited+0x109/0x140
<6>[ 440.681164] irqfd_resampler_shutdown+0xf0/0x150
<6>[ 440.681481] irqfd_shutdown+0x66/0xc0
<6>[ 440.681738] process_scheduled_works+0x219/0x450
<6>[ 440.682055] worker_thread+0x2a7/0x3b0
<6>[ 440.682403] kthread+0x230/0x270

The sequence is: worker 3453 acquires resampler_lock, and calls
synchronize_srcu_expedited() while holding the lock. This queues
process_srcu on rcu_gp, then blocks waiting for the GP to complete.
Workers 51 and 3449 are blocked trying to acquire the same resampler_lock.

Regarding your dispatch question: all three workers are in D state, so
they have all called schedule() and wq_worker_sleeping() should have
decremented pool->nr_running to zero. With nr_running == 0 and
process_srcu in the worklist, needs_more_worker() should be true and an
idle worker should be woken via kick_pool() when process_srcu is enqueued.
Why none of the 8 idle workers end up dispatching process_srcu is not
entirely clear to me.

Moving the synchronize_srcu_expedited() does solve this issue, but it
is not exactly sure why the deadlock between irqfd-shutdown workers is
causing the work queue to stall.

The full dmesg is at: https://gist.github.com/vineethrp/883db560a4503612448db9b10e02a9b5

Hi Paul,

> Just to be clear, I am guessing that you have the workqueues counterpart
> to a fork bomb. However, if you are using a small finite number of
> workqueue handlers, then we need to make adjustments in SRCU, workqueues,
> or maybe SRCU's use of workqueues.

In this log, I am not seeing a workqueue being stressed out. There are
8 idle workers, but for some reason no worker is assigned to run process_srcu.
Not sure if its a work queue related race condition or if its working as
intended to not kick new workers if there are in-flight workers in D state.

> SRCU and RCU use their own workqueue, which no one else should be
> using (and that prohibition most definitely includes the irqfd workers).

kvm-irqfd-cleanup and rcu_gp while being separate workqueues, share the
same per-CPU pool(pwq 22). Both are CPU-bound: rcu_gp has flags=0x108
(WQ_UNBOUND|WQ_FREEZABLE) but its pwq for CPU 5 resolves to the same
per-CPU pool (pool 22, flags=0x0) as kvm-irqfd-cleanup (flags=0x0).
I think CPU-bound workqueues share the per-CPU pool regardless of being
separate workqueues and these two workqueues end up competing for the
same underlying pool's workers.

Making kvm-irqfd-cleanup unbound (WQ_UNBOUND) would place it on a
separate pool from rcu_gp, preventing this interference and fixing the
stall I guess.

Thanks,
Vineeth