[PATCH] x86/irq: Cure live lock in irq_force_complete_move()

From: Thomas Gleixner
Date: Fri Mar 11 2016 - 15:20:52 EST

Harry reported, that he's able to trigger a system freeze with cpu hot
unplug. The freeze turned out to be a live lock caused by recent changes in

When fixup_irqs() and from there irq_force_complete_move() is called on the
dying cpu, then all other cpus are in stop machine and wait for the dying cpu
to complete the teardown. If there is a move of an interrupt pending then
irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain
mask and waits for them to clear the mask. That's obviously impossible as
those cpus are firmly stuck in stop machine with interrupts disabled.

I should have known that, but I completely overlooked it being concentrated on
the locking issues around the vectors. And the existance of the call to
__irq_complete_move() in the code, which actually sends the cleanup IPI made
it look completely logical that waiting for that cleanup to complete is the
right thing to do. That call was bogus even before the recent changes, but it
was the pointless distraction which tricked me not to see the real issue. :(

So looking deeper into the issue I discovered that the cleanup of the vectors
is actually pretty simple. We have to look at three cases:

1) The move_in_progress flag of the interrupt is set

A) The interrupt must be moved in interrupt context, i.e. the affinity
change takes place when the next interrupt happens.

In that case the io_apic is not yet updated to the new vector, so we can
simply restore the target domain mask to the previous state,
i.e. old_domain, and restore the old vector in the configuration data.

Further we need to check whether the affinity update actually changed
the vector or merily reduced the target mask. If it's a new vector, then
we need to clear the vector entries of the new vector.

This undoes the pending affinity change to the old target, but with the
outgoing cpu cleared in the target domain mask.

B) The interrupt can be moved in any context, i.e. the io_apic has been
updated with the new vector already, but no interrupt was delivered
after that update, so we know for sure, that the next interrupt will be
delivered to the new vector.

So it's the same as case #2 where the cleanup IPI has been issued
already and the domain cpu mask is not yet empty. See below.

2) The move_in_progress flag is not set and the old_domain cpu mask is not

That means, that an interrupt was delivered after the change and the
cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have
responded to it yet.

It does not matter in which context the io_apic update happened, the
io_apic contains the new vector already. See also case 1B)

So we know at this point that the next interrupt will arrive on the new
vector, so we can safely cleanup the old vectors on the cpus in the
old_domain cpu mask.

Fixes: 98229aa36caa "x86/irq: Plug vector cleanup race"
Reported-by: Harry Junior <harryjr@xxxxxxxxxx>
Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
arch/x86/include/asm/hw_irq.h | 1
arch/x86/kernel/apic/vector.c | 94 +++++++++++++++++++++++++++++++++---------
2 files changed, 77 insertions(+), 18 deletions(-)

--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -141,6 +141,7 @@ struct irq_alloc_info {
struct irq_cfg {
unsigned int dest_apicid;
u8 vector;
+ u8 old_vector;

extern struct irq_cfg *irq_cfg(unsigned int irq);
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -213,6 +213,7 @@ static int __assign_irq_vector(int irq,
cpumask_and(d->old_domain, d->old_domain, cpu_online_mask);
d->move_in_progress = !cpumask_empty(d->old_domain);
+ d->cfg.old_vector = d->move_in_progress ? d->cfg.vector : 0;
d->cfg.vector = vector;
cpumask_copy(d->domain, vector_cpumask);
@@ -655,46 +656,103 @@ void irq_complete_move(struct irq_cfg *c

- * Called with @desc->lock held and interrupts disabled.
+ * Called from fixup_irqs() with @desc->lock held and interrupts disabled.
void irq_force_complete_move(struct irq_desc *desc)
struct irq_data *irqdata = irq_desc_get_irq_data(desc);
struct apic_chip_data *data = apic_chip_data(irqdata);
struct irq_cfg *cfg = data ? &data->cfg : NULL;
+ unsigned int cpu, v;

if (!cfg)

- __irq_complete_move(cfg, cfg->vector);
* This is tricky. If the cleanup of @data->old_domain has not been
* done yet, then the following setaffinity call will fail with
* -EBUSY. This can leave the interrupt in a stale state.
- * The cleanup cannot make progress because we hold @desc->lock. So in
- * case @data->old_domain is not yet cleaned up, we need to drop the
- * lock and acquire it again. @desc cannot go away, because the
- * hotplug code holds the sparse irq lock.
+ * All CPUs are stuck in stop machine with interrupts disabled so
+ * calling __irq_complete_move() would be completely pointless.
- /* Clean out all offline cpus (including ourself) first. */
+ /*
+ * Clean out all offline cpus (including the outgoing one) from the
+ * old_domain mask.
+ */
cpumask_and(data->old_domain, data->old_domain, cpu_online_mask);
- while (!cpumask_empty(data->old_domain)) {
+ /*
+ * If move_in_progress is cleared and the old_domain mask is empty,
+ * then there is nothing to cleanup. fixup_irqs() will take care of
+ * the stale vectors on the outgoing cpu.
+ */
+ if (!data->move_in_progress && cpumask_empty(data->old_domain)) {
- raw_spin_unlock(&desc->lock);
- cpu_relax();
- raw_spin_lock(&desc->lock);
+ return;
+ }
+ /*
+ * We have to distinguish three cases:
+ *
+ * 1) The interrupt is in move_in_progress state and the interrupt is
+ * not marked with IRQ_MOVE_PCNTXT. That means the io_apic still
+ * points to the old vector.
+ *
+ * 2) The interrupt is in move_in_progress state and the interrupt is
+ * marked with IRQ_MOVE_PCNTXT. That means the io_apic already has
+ * the new vector.
+ *
+ * 3) The interrupt has been moved, the io_apic has already the new
+ * vector, but the cleanup IPIs have not been processed yet.
+ *
+ * #2 and #3 can be handled in the same way as the old vector is not
+ * longer in use and the vector entries of the cpus in old_domain mask
+ * can be cleaned up safely now.
+ */
+ if (!irqd_can_move_in_process_context(irqdata) &&
+ data->move_in_progress) {
- * Reevaluate apic_chip_data. It might have been cleared after
- * we dropped @desc->lock.
+ * We restore old_domain (the offline cpus have been masked
+ * out), clear old domain and move_in_progress.
+ *
+ * If the old vector is the same as the new vector, then there
+ * is nothing to do here because the move only reduced the
+ * domain mask, but did not change the vector.
- data = apic_chip_data(irqdata);
- if (!data)
- return;
- raw_spin_lock(&vector_lock);
+ if (cfg->vector != cfg->old_vector) {
+ /*
+ * A move to a new vector is pending, but the new
+ * vector is not yet in use because the ioapic still
+ * has the old vector. Clean up the new vector.
+ */
+ v = cfg->vector;
+ for_each_cpu(cpu, data->domain)
+ per_cpu(vector_irq, cpu)[v] = VECTOR_UNUSED;
+ }
+ /*
+ * Restore the old vector in the configuration data and
+ * restore the old_domain mask.
+ */
+ cfg->vector = cfg->old_vector;
+ cpumask_copy(data->domain, data->old_domain);
+ } else {
+ /*
+ * If old_domain is not empty, then other cpus still have the
+ * irq descriptor set in their vector array. Clean it up, it's
+ * not longer possible that the interrupt happens on that
+ * vector.
+ */
+ v = cfg->old_vector;
+ for_each_cpu(cpu, data->old_domain)
+ per_cpu(vector_irq, cpu)[v] = VECTOR_UNUSED;
+ /* Cleanup the left overs of the (half finished) move */
+ cpumask_clear(data->old_domain);
+ data->move_in_progress = 0;
+ cfg->old_vector = 0;