Re: [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock

Next message: Mikhail Gavrilov: "Re: [PATCH] Input: uinput - fix circular locking dependency with ff-core"
Previous message: Gregory Price: "Re: [PATCH v4 1/3] cxl/core/region: move pmem region driver logic into region_pmem.c"
In reply to: Vineeth Pillai (Google): "Re: [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock"
Next in thread: Sonam Sanju: "[PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Sonam Sanju

Date: Mon Mar 23 2026 - 01:38:48 EST

On Fri, Mar 20, 2026 at 08:56:33AM -0400, Vineeth Pillai (Google) wrote:
> I think we might have this issue in the kvm_irqfd_assign path as well
> where synchronize_srcu_expedited is called with the resampler_lock
> held. I saw similar lockup during a stress test where VMs were created
> and destroyed continously. I could see one task waiting on SRCU GP:
>
> [ T93] task:crosvm_security state:D stack:0 pid:8215 tgid:8215 ppid:1 task_flags:0x400000 flags:0x00080002.
> [ T93] Call Trace:
> [ T93] synchronize_srcu_expedited+0x109/0x140
> [ T93] kvm_irqfd+0x362/0x5e0
> [ T93] kvm_vm_ioctl+0x706/0x780
>
> And another task waiting on the mutex:
>
> [ C0] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> [ C0] __mutex_lock+0x413/0xe40
> [ C0] irqfd_resampler_shutdown+0x23/0x150
> [ C0] irqfd_shutdown+0x66/0xc0
>
> The work queue was full as well I think:
>
> [ C0] pwq 46: cpus=11 node=0 flags=0x0 nice=0 active=1024 refcnt=2062

Yes, You are right. The kvm_irqfd_assign() path has the same deadlock pattern.

> There were other tasks waiting for SRCU GP completion in the resampler
> shutdown path. Also, there were other traces showing lockups (mostly in
> mm), but I think thats a secondary effect of this lockup and might not
> be relevant.

Yes, that matches what we see on our side as well â?? the primary deadlock
in the KVM irqfd paths causes cascading failures: workqueue starvation
leads to blocked do_sync_work (superblock sync), fsnotify workers stuck
on __synchronize_srcu, and eventually init (pid 1) blocks in
ext4_put_super -> __flush_work. The mm lockups you see are almost
certainly secondary effects.

Will send v2 shortly with both paths fixed in a single patch.

--
Sonam Sanju