Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

From: Sonam Sanju

Date: Wed Apr 01 2026 - 10:38:37 EST

From: Sonam Sanju <sonam.sanju@xxxxxxxxx>

On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> Building on the discussion so far, it would be helpful from the SRCU
> side to gather a bit more evidence to classify the issue.
>
> Calling synchronize_srcu_expedited() while holding a mutex is generally
> valid, so the observed behavior may be workload-dependent.

> The reported deadlock seems to rely on the assumption that SRCU grace
> period progress is indirectly blocked by irqfd workqueue saturation.
> It would be good to confirm whether that assumption actually holds.

I went back through our logs from two independent crash instances and
can now provide data for each of your questions.

> 1) Are SRCU GP kthreads/workers still making forward progress when
> the system is stuck?

No. In both crash instances, process_srcu work items remain permanently
"pending" (never "in-flight") throughout the entire hang.

Instance 1 â?? kernel 6.18.8, pool 14 (cpus=3):

[ 62.712760] workqueue rcu_gp: flags=0x108
[ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
[ 62.717801] pending: 2*process_srcu

[ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later)
[ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
[ 187.735093] pending: 2*process_srcu (still pending)

9 consecutive dumps from t=62s to t=312s â?? process_srcu never runs.

Instance 2 â?? kernel 6.18.2, pool 22 (cpus=5):

[ 93.280711] workqueue rcu_gp: flags=0x108
[ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
[ 93.280716] pending: process_srcu

[ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later)
[ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
[ 309.040806] pending: process_srcu (still pending)

8 consecutive dumps from t=93s to t=341s â?? process_srcu never runs.

In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
where the kvm-irqfd-cleanup workers are blocked. Both pools have idle
workers but are marked as hung/stalled:

Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)

> 2) How many irqfd workers are active in the reported scenario, and
> can they saturate CPU or worker pools?

4 kvm-irqfd-cleanup workers in both instances, consistently across all
dumps:

Instance 1 ( pool 14 / cpus=3):

[ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
[ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
[ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
102:irqfd_shutdown ,39:irqfd_shutdown

Instance 2 ( pool 22 / cpus=5):

[ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
[ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
[ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
4241:irqfd_shutdown ,4243:irqfd_shutdown

These are from crosvm instances with multiple virtio devices
(virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
with a resampler. During VM shutdown, all irqfds are detached
concurrently, queueing that many irqfd_shutdown work items.

The 4 workers are not saturating CPU â?? they're all in D state. But they
ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.

> 3) Do we have a concrete wait-for cycle showing that tasks blocked
> on resampler_lock are in turn preventing SRCU GP completion?

Yes, in both instances the hung task dump identifies the mutex holder
stuck in synchronize_srcu, with the other workers waiting on the mutex.

Instance 1 (t=314s):

Worker pid 4044 â?? MUTEX HOLDER, stuck in synchronize_srcu:

[ 315.963979] task:kworker/3:8 state:D pid:4044
[ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
[ 316.012504] __synchronize_srcu+0x100/0x130
[ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)

Workers pid 39, 102, 157 â?? MUTEX WAITERS:

[ 314.793025] task:kworker/3:4 state:D pid:157
[ 314.837472] __mutex_lock+0x409/0xd90
[ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)

Instance 2 (t=343s):

Worker pid 4241 â?? MUTEX HOLDER, stuck in synchronize_srcu:

[ 343.193294] task:kworker/5:4 state:D pid:4241
[ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
[ 343.193328] __synchronize_srcu+0x100/0x130
[ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)

Workers pid 151, 4243, 4246 â?? MUTEX WAITERS:

[ 343.193369] task:kworker/5:6 state:D pid:4243
[ 343.193397] __mutex_lock+0x37d/0xbb0
[ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)

Both instances show the identical wait-for cycle:

1. One worker holds resampler_lock, blocks in __synchronize_srcu
(waiting for SRCU grace period)
2. SRCU GP needs process_srcu to run â?? but it stays "pending"
on the same pool
3. Other irqfd workers block on __mutex_lock in the same pool
4. The pool is marked "hung" and no pending work makes progress
for 250-300 seconds until kernel panic

> 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> and kvm_irqfd_assign() paths?

In our 4 crash instances the stuck mutex holder is always in
irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This
is consistent â?? these are all VM shutdown scenarios where only
irqfd_shutdown workqueue items run.

The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
during a VM create/destroy stress test where assign and shutdown race.
His traces showed kvm_irqfd (the assign path) stuck in
synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.

> If SRCU GP remains independent, it would help distinguish whether
> this is a strict deadlock or a form of workqueue starvation / lock
> contention.

Based on the data from both instances, SRCU GP is NOT remaining
independent. process_srcu stays permanently pending on the affected
per-CPU pool for 250-300 seconds. But it's not just process_srcu â??
ALL pending work on the pool is stuck, including items from events,
cgroup, mm, slub, and other workqueues.

> A timestamp-correlated dump (blocked stacks + workqueue state +
> SRCU GP activity) would likely be sufficient to classify this.

I hope the correlated dumps above from both instances are helpful.
To summarize the timeline (consistent across both):

t=0: VM shutdown begins, crosvm detaches irqfds
t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
One worker acquires resampler_lock, enters synchronize_srcu
Other 3 workers block on __mutex_lock
t=~43: First "BUG: workqueue lockup" â?? pool detected stuck
rcu_gp: process_srcu shown as "pending" on same pool
t=~93 Through t=~312: Repeated dumps every ~30s
process_srcu remains permanently "pending"
Pool has idle workers but no pending work executes
t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
t=~316: init triggers sysrq crash â?? kernel panic

> Happy to help look at traces if available.

I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
instances. Shall I post them or send them off-list?

Thanks,
Sonam