Re: [PATCH v2] KVM: eventfd: Use WQ_UNBOUND workqueue for irqfd cleanup - New logs confirm preemption race

From: Sonam Sanju

Date: Tue Apr 21 2026 - 12:59:58 EST

Hi Vineeth, Kunwu, Tejun,

Collected new crash logs with additional debug instrumentation in
wq_worker_sleeping(), kick_pool(), and show_one_worker_pool() to capture
pool state during the hang. The results conclusively confirm Vineeth's
preemption race theory.

>From the new logs:

1. Pool dump with nr_running/nr_idle (added instrumentation):

pool 10: cpus=2 flags=0x0 hung=201s workers=11 nr_running=1 nr_idle=5

11 workers, 5 idle, 6 in D-state (all irqfd_shutdown) -- yet
nr_running=1. No worker is actually running on CPU 2.

2. NMI backtrace confirms CPU 2 is completely idle:

NMI backtrace for cpu 2 skipped: idling at intel_idle+0x57/0xa0

So nr_running=1 is a phantom count -- no worker is running, but
the pool thinks one is.

3. The first stuck worker (kworker/2:0, PID 33) shows the preemption
in wq_worker_sleeping:

kworker/2:0 state:D Workqueue: kvm-irqfd-cleanup irqfd_shutdown
__schedule+0x87a/0xd60
preempt_schedule_irq+0x4a/0x90
asm_fred_entrypoint_kernel+0x41/0x70
___ratelimit+0x1a1/0x1f0 <-- inside pr_info_ratelimited
wq_worker_sleeping+0x53/0x190 <-- preempted HERE
schedule+0x30/0xe0
schedule_preempt_disabled+0x10/0x20
__mutex_lock+0x413/0xe40
irqfd_resampler_shutdown+0x53/0x200
irqfd_shutdown+0xfa/0x190

This confirms the exact race: a reschedule IPI interrupted
wq_worker_sleeping() after worker->sleeping was set to 1 but
before pool->nr_running was decremented. The preemption triggered
wq_worker_running() which incremented nr_running (1->2), then
on resume the decrement brought it back to 1 instead of 0.

4. The second pool dump 31 seconds later shows the stall is permanent:

pool 10: cpus=2 flags=0x0 hung=232s workers=11 nr_running=1 nr_idle=5

Same phantom nr_running=1, hung time growing.

5. The deadlock chain:
- PID 33: holds resampler_lock mutex, stuck in wq_worker_sleeping
- PID 520: past mutex, stuck in synchronize_srcu_expedited
- PIDs 120, 4792, 4793, 4796: waiting on resampler_lock mutex
- crosvm_vcpu2: waiting in kvm_vm_release -> __flush_workqueue
- init (PID 1): stuck in pci_device_shutdown -> __flush_work
- Multiple userspace processes stuck in fsnotify_destroy_group
- Reboot thread timed out, system triggered sysrq crash

6. kick_pool_skip debug print fired for other pools but NOT for
pool 10 -- because need_more_worker() was never true (nr_running
was never 0), so kick_pool() was never even called for this pool.

Regarding a fix, we can consider a workqueue-level fix in
wq_worker_sleeping() itself:

void wq_worker_sleeping(struct task_struct *task)
{
...
if (READ_ONCE(worker->sleeping))
return;

+ preempt_disable();
WRITE_ONCE(worker->sleeping, 1);
raw_spin_lock_irq(&pool->lock);

if (worker->flags & WORKER_NOT_RUNNING) {
raw_spin_unlock_irq(&pool->lock);
+ preempt_enable();
return;
}

pool->nr_running--;
if (kick_pool(pool))
worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;

raw_spin_unlock_irq(&pool->lock);
+ preempt_enable();
}

The idea is to disable preemption from sleeping=1 until we hold the pool
lock (which disables IRQs). This prevents the reschedule IPI from
triggering preempt_schedule_irq() in this window. Note that
wq_worker_running() already uses preempt_disable/enable around its
nr_running++ for a similar race against unbind_workers().

Does this approach look correct to you?

Thanks,
Sonam