Re: [PATCH 0/3] alpha SMP fixes for EV7/Marvel
From: Magnus Lindholm
Date: Sun May 31 2026 - 04:26:25 EST
On Sat, May 30, 2026 at 10:25 PM Matt Turner <mattst88@xxxxxxxxx> wrote:
>
> I acquired an AlphaServer ES47 in 2010, and it's never been stable --
> deadlocking after random amounts of time. I could never make any
> connections with load, uptime, etc.
>
> The only dots I could connect was that the git test suite would always
> trigger the deadlock.
>
> I spent some time over the last week playing with Claude and have found
> *a* solution. With the first two patches in place, I've successfully run
> the git test suite 6 times in a row. I've never previously seen it run
> successfully without deadlocking the system.
>
> The first patch is generally applicable (not specific to EV7/Marvel).
> I'm unsure why this would never have caused problems on other systems
> (or why it would only be relevant for EV7/Marvel). That gives me some
> pause.
>
> The second patch applies only to EV7/Marvel, I believe. tl;dr: IPIs seem
> to be lost.
>
> The third patch adds some accounting to /proc/interrupts to report the
> number of lost interrupts, confirming the problem from patch 2.
>
> Please review.
>
> Matt
>
Hi Matt,
Thanks for working on this. This is very impressive work, and it looks like
you're close to nailing down some long-standing bugs and making the Marvel
platform a lot more usable with SMP kernels. The lost-edge IPI diagnosis looks
plausible, but I hit a few issues while reviewing/testing the series.
First, after applying the series I hit a build failure. Patch 1 adds:
extern spinlock_t alpha_smp_ipi_lock;
to arch/alpha/include/asm/smp.h, but that header can be included before
spinlock_t is defined, e.g. while building kernel/sched/rq-offsets.s:
arch/alpha/include/asm/smp.h:60:8: error: unknown type name 'spinlock_t'
Including <linux/spinlock_types.h> from asm/smp.h, or avoiding exposing
spinlock_t from that early header, fixes that part.
Patch 2 also appears not to be buildable independently: it updates
cpu_data[].rescued_{reschedule,call_func,cpu_stop}_count, but those fields are
only introduced in patch 3. Please either move the struct additions into patch
2, move the accounting into patch 3, or squash those patches.
I also wonder if alpha_drain_ipi() should disable interrupts before looking at
the per-CPU IPI word. That would avoid reading ipi_data[smp_processor_id()].bits
before local IRQs are disabled, and would keep the CPU lookup and pending-bit
check in the same IRQ-disabled section:
local_irq_save(flags);
cpu = smp_processor_id();
if (READ_ONCE(ipi_data[cpu].bits))
handle_ipi(NULL);
local_irq_restore(flags);
That looks safer than reading ipi_data[smp_processor_id()].bits before
local_irq_save().
On the design side, patch 1 says it serializes all synchronous IPI operations,
but it seems to only wrap the Alpha arch TLB/icache/IMB users. Either the commit
message should narrow that claim, or the serialization needs to live lower in
the IPI/call-function path. The patch seems to do: "serialize a subset of Alpha
arch synchronous IPI users, mainly TLB/cache/IMB flushes"
Also, the series does not apply cleanly to current v7.1-rc1 directly.
It appears to
depend on the Alpha GENERIC_ENTRY series:
Link: https://lore.kernel.org/linux-alpha/20260529142322.1362438-1-linmag7@xxxxxxxxx/T/#t
which is still under review and not in mainline yet. Please mention
that dependency
in the cover letter and include the base commit and/or a lore link to
the prerequisite
series.
Finally, this adds a global spin_trylock()/spin_unlock() around hot paths such
as migrate_flush_tlb_page(). That has no impact on non-Alpha architectures, but
it serializes these operations for all Alpha SMP systems, while the bug
description is EV7/Marvel/IO7-specific. Can this be justified for non-EV7
systems, or gated to the affected platform?
Thanks,
Magnus