[PATCH 1/3] alpha: smp: Serialize all synchronous IPI operations to fix SMP deadlock

From: Matt Turner

Date: Sat May 30 2026 - 16:26:41 EST

Two or more CPUs simultaneously calling any function that uses
on_each_cpu(wait=1) or smp_call_function(wait=1) deadlock: each blocks
in csd_lock_wait spinning while waiting for the remote CPU to signal CSD
completion. While spinning, neither CPU can receive the other's IPI, so
neither completion signal arrives — permanent hang.

Affected callers: smp_imb, flush_tlb_all, flush_tlb_mm, flush_tlb_page,
flush_icache_user_page (smp.c) and migrate_flush_tlb_page (tlbflush.c).

Introduce alpha_smp_ipi_lock (plain spinlock, defined in smp.c, declared
in asm/smp.h) and apply it to all six callers. Rather than spin_lock(),
use a trylock loop with alpha_drain_ipi(): if the lock is held, the loser
actively drains any pending IPI bits on the local CPU before retrying.
This is necessary because some callers hold IRQs disabled (e.g. paths
that take spin_lock_irqsave), so no RTC interrupt will fire to rescue a
lost wripir edge via alpha_poll_ipi_inirq(). alpha_drain_ipi() calls
handle_ipi() under local_irq_save/restore, satisfying handle_ipi()'s
requirement that IRQs be disabled, without touching lockdep
hardirq-context state.

This fix is necessary but not sufficient. A separate, independent
deadlock path exists: if the target CPU is inside do_entInt at IPL=7
when wripir fires, the hardware IPI edge is lost and the sending CPU
spins forever even when only one CPU is issuing a wait=1 call. That
race is fixed independently by alpha_poll_ipi_inirq() (see follow-on
commit). Both fixes are required for a complete solution.

The deadlock has been observed on EV7/Marvel under workloads generating
a high rate of synchronous TLB flush IPIs (e.g. the git test suite).

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Matt Turner <mattst88@xxxxxxxxx>
---
arch/alpha/include/asm/smp.h | 9 ++++++
arch/alpha/kernel/smp.c | 62 ++++++++++++++++++++++++++++++++++++
arch/alpha/mm/tlbflush.c | 3 ++
3 files changed, 74 insertions(+)

diff --git ./arch/alpha/include/asm/smp.h ./arch/alpha/include/asm/smp.h
index 2264ae72673b..8bd529376cf6 100644
--- ./arch/alpha/include/asm/smp.h
+++ ./arch/alpha/include/asm/smp.h
@@ -48,6 +48,15 @@ extern int smp_num_cpus;
extern void arch_send_call_function_single_ipi(int cpu);
extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);

+/*
+ * Global spinlock serializing all synchronous (wait=1) IPI callers.
+ * Callers must use the trylock+alpha_drain_ipi() pattern, not spin_lock(),
+ * because some call sites hold IRQs disabled and cannot rely on the RTC
+ * interrupt to rescue a lost wripir edge.
+ */
+extern spinlock_t alpha_smp_ipi_lock;
+extern void alpha_drain_ipi(void);
+
#else /* CONFIG_SMP */

#define hard_smp_processor_id() 0
diff --git ./arch/alpha/kernel/smp.c ./arch/alpha/kernel/smp.c
index ed06367ece57..d900da49b0d8 100644
--- ./arch/alpha/kernel/smp.c
+++ ./arch/alpha/kernel/smp.c
@@ -597,11 +597,61 @@ ipi_imb(void *ignored)
imb();
}

+/*
+ * Serialize all synchronous (wait=1) IPI operations to prevent cross-CPU
+ * deadlock on EV7/Marvel. If two CPUs simultaneously call any function that
+ * uses on_each_cpu(wait=1) or smp_call_function(wait=1), each blocks in
+ * csd_lock_wait spinning for the remote CPU to signal completion. While
+ * spinning, neither CPU can receive the other's IPI, so neither completion
+ * signal arrives — permanent hang.
+ *
+ * A plain spinlock (not irqsave) is intentional: the CPU that loses the lock
+ * race spins with IRQs enabled and can service the winner's IPI before
+ * taking the lock itself.
+ *
+ * All callers of synchronous IPIs — including migrate_flush_tlb_page in
+ * tlbflush.c — must hold this lock.
+ */
+DEFINE_SPINLOCK(alpha_smp_ipi_lock);
+
+/*
+ * Drain any pending IPIs for this CPU while spinning on alpha_smp_ipi_lock.
+ *
+ * The lock holder has already sent a wripir but is blocked in csd_lock_wait
+ * waiting for our IPI ACK. We cannot simply spin on the lock: if IRQs are
+ * disabled (e.g. caller holds a spin_lock_irqsave), no RTC interrupt will
+ * fire and the lost wripir edge is never rescued by alpha_poll_ipi_inirq.
+ *
+ * Call this from the trylock loop so the IPI is processed even with IRQs
+ * disabled, breaking the circular wait.
+ *
+ * handle_ipi() requires IRQs disabled: generic_smp_call_function_interrupt
+ * asserts lockdep_assert_irqs_disabled(). Use local_irq_save/restore so
+ * this is safe whether the caller has IRQs enabled (e.g. page fault path)
+ * or disabled (e.g. spin_lock_irqsave holder). Avoid __irq_enter_raw/
+ * __irq_exit_raw: those manipulate lockdep hardirq-context state and trigger
+ * a lockdep WARNING when called while lockdep already tracks hardirq context.
+ */
+void alpha_drain_ipi(void)
+{
+ unsigned long flags;
+
+ if (!READ_ONCE(ipi_data[smp_processor_id()].bits))
+ return;
+
+ local_irq_save(flags);
+ handle_ipi(NULL); /* regs unused in handle_ipi() */
+ local_irq_restore(flags);
+}
+
void
smp_imb(void)
{
/* Must wait other processors to flush their icache before continue. */
+ while (!spin_trylock(&alpha_smp_ipi_lock))
+ alpha_drain_ipi();
on_each_cpu(ipi_imb, NULL, 1);
+ spin_unlock(&alpha_smp_ipi_lock);
}
EXPORT_SYMBOL(smp_imb);

@@ -616,7 +666,10 @@ flush_tlb_all(void)
{
/* Although we don't have any data to pass, we do want to
synchronize with the other processors. */
+ while (!spin_trylock(&alpha_smp_ipi_lock))
+ alpha_drain_ipi();
on_each_cpu(ipi_flush_tlb_all, NULL, 1);
+ spin_unlock(&alpha_smp_ipi_lock);
}

#define asn_locked() (cpu_data[smp_processor_id()].asn_lock)
@@ -651,7 +704,10 @@ flush_tlb_mm(struct mm_struct *mm)
}
}

+ while (!spin_trylock(&alpha_smp_ipi_lock))
+ alpha_drain_ipi();
smp_call_function(ipi_flush_tlb_mm, mm, 1);
+ spin_unlock(&alpha_smp_ipi_lock);

preempt_enable();
}
@@ -702,7 +758,10 @@ flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
data.mm = mm;
data.addr = addr;

+ while (!spin_trylock(&alpha_smp_ipi_lock))
+ alpha_drain_ipi();
smp_call_function(ipi_flush_tlb_page, &data, 1);
+ spin_unlock(&alpha_smp_ipi_lock);

preempt_enable();
}
@@ -752,7 +811,10 @@ flush_icache_user_page(struct vm_area_struct *vma, struct page *page,
}
}

+ while (!spin_trylock(&alpha_smp_ipi_lock))
+ alpha_drain_ipi();
smp_call_function(ipi_flush_icache_page, mm, 1);
+ spin_unlock(&alpha_smp_ipi_lock);

preempt_enable();
}
diff --git ./arch/alpha/mm/tlbflush.c ./arch/alpha/mm/tlbflush.c
index ccbc317b9a34..37607d08796b 100644
--- ./arch/alpha/mm/tlbflush.c
+++ ./arch/alpha/mm/tlbflush.c
@@ -89,7 +89,10 @@ void migrate_flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
* This is the "combined" version of flush_tlb_mm + per-page invalidate.
*/
preempt_disable();
+ while (!spin_trylock(&alpha_smp_ipi_lock))
+ alpha_drain_ipi();
on_each_cpu(ipi_flush_mm_and_page, &d, 1);
+ spin_unlock(&alpha_smp_ipi_lock);

/*
* mimic flush_tlb_mm()'s mm_users<=1 optimization.
--
2.53.0