possible deadlock due to irq_set_thread_affinity() calling into the scheduler (was Re: [PATCH v3 38/62] KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU)
From: Paolo Bonzini
Date: Mon Dec 22 2025 - 09:09:25 EST
On 12/22/25 10:16, Ankit Soni wrote:
======================================================
WARNING: possible circular locking dependency detected
6.19.0-rc2 #20 Tainted: G E
------------------------------------------------------
CPU 58/KVM/28597 is trying to acquire lock:
ff12c47d4b1f34c0 (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0
but task is already holding lock:
ff12c49b28552110 (&svm->ir_list_lock){....}-{2:2}, at: avic_pi_update_irte+0x147/0x270 [kvm_amd]
which lock already depends on the new lock.
Chain exists of:
&irq_desc_lock_class --> &rq->__lock --> &svm->ir_list_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&svm->ir_list_lock);
lock(&rq->__lock);
lock(&svm->ir_list_lock);
lock(&irq_desc_lock_class);
*** DEADLOCK ***
So lockdep sees:
&irq_desc_lock_class -> &rq->__lock -> &svm->ir_list_lock
while avic_pi_update_irte() currently holds svm->ir_list_lock and then
takes irq_desc_lock via irq_set_vcpu_affinity(), which creates the
potential inversion.
- Is this lockdep warning expected/benign in this code path, or does it
indicate a real potential deadlock between svm->ir_list_lock and
irq_desc_lock with AVIC + irq_bypass + VFIO?
I'd treat it as a potential (if unlikely) deadlock:
(a) irq_set_thread_affinity triggers the scheduler via wake_up_process,
while irq_desc->lock is taken
(b) the scheduler calls into KVM with rq_lock taken, and KVM uses
ir_list_lock within __avic_vcpu_load/__avic_vcpu_put
(c) KVM wants to block scheduling for a while and uses ir_list_lock for
that purpose, but then takes irq_set_vcpu_affinity takes irq_desc->lock.
I don't think there's an alternative choice of lock for (c); and there's
no easy way to pull the irq_desc->lock out of the IRQ subsystem--in fact
the stickiness of the situation comes from rq->rq_lock and
irq_desc->lock being both internal and not leaf.
Of the three, the most sketchy is (a); notably, __setup_irq() calls
wake_up_process outside desc->lock. Therefore I'd like so much to treat
it as a kernel/irq/ bug; and the simplest (perhaps too simple...) fix is
to drop the wake_up_process(). The only cost is extra latency on the
next interrupt after an affinity change.
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 8b1b4c8a4f54..fc135bd079a4 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -189,14 +189,10 @@ static void irq_set_thread_affinity(struct irq_desc *desc)
struct irqaction *action;
for_each_action_of_desc(desc, action) {
- if (action->thread) {
+ if (action->thread)
set_bit(IRQTF_AFFINITY, &action->thread_flags);
- wake_up_process(action->thread);
- }
- if (action->secondary && action->secondary->thread) {
+ if (action->secondary && action->secondary->thread)
set_bit(IRQTF_AFFINITY, &action->secondary->thread_flags);
- wake_up_process(action->secondary->thread);
- }
}
}
Marc, what do you think?
Paolo