Re: [PATCH v4 0/4] arm64: cross-CPU NMI via SDEI
From: YinFengwei
Date: Fri Jun 26 2026 - 04:31:40 EST
Hi Kirill
On Mon, Jun 22, 2026 at 02:56:16PM +0100, Kiryl Shutsemau wrote:
> On Fri, Jun 19, 2026 at 03:26:21PM +0100, Marc Zyngier wrote:
> > > Does your firmware set ICC_CTLR_EL1.PMHE? I'd be curious to see the
> > > numbers if the DSB was omitted on the enable path.
> >
> > I certainly don't observe this sort of overhead on the HW I have
> > access to, and would like to understand where this is coming from with
> > actual profiling data.
>
> Full disclosure: the ~66% figures come from internal testing about a year ago.
> I no longer have the details of the machine it ran on and can't confirm whether
> ICC_CTLR_EL1.PMHE was set there -- it may well have been. I shouldn't have
> carried those numbers forward without being able to stand behind them, so
> please disregard them.
>
> Here are fresh numbers from NVIDIA Grace (Neoverse V2). Importantly, this
> box reports:
>
> GICv3: Pseudo-NMIs enabled using relaxed ICC_PMR_EL1 synchronisation
>
> i.e. PMHE == 0, so the synchronising DSB on the unmask path is already
> patched to a NOP (ARM64_HAS_GIC_PRIO_RELAXED_SYNC). What's left is the
> floor cost of PMR-based masking itself plus the PMR save/restore on
> exception entry/exit -- not the DSB. So this is the case Catalin asked
> about (DSB omitted), and there is still a measurable cost.
>
> A trivial single-threaded gettid() loop (1e6 calls, median of 5,
> performance governor, ASLR off):
>
> pseudo_nmi=0 (DAIF): 178.4 ns/call
> pseudo_nmi=1 (PMR): 252.5 ns/call
> delta: +74.1 ns/call (~230-250 cycles)
> +41.5% wall time / 0.706 throughput
I tested the u-bench.c on a Neoverse N2 based arm64 server. The result
is as following:
pseudo_nmi=0 (DAIF): 96.3 ns/call
pseudo_nmi=1 (PMR): 169.8 ns/call
delta: +73.5 ns/call
>
> --- u-bench.c ---
> #include <unistd.h>
> #include <sys/syscall.h>
> #include <time.h>
> #include <stdio.h>
> int main(void) {
> struct timespec a, b;
> clock_gettime(CLOCK_MONOTONIC, &a);
> for (long i = 0; i < 1000000; i++)
> syscall(SYS_gettid);
> clock_gettime(CLOCK_MONOTONIC, &b);
> printf("%f ns\n", (b.tv_sec-a.tv_sec)*1e9 + (b.tv_nsec-a.tv_nsec));
> return 0;
> }
>
> will-it-scale agrees independently. sched_yield (ops/s, median of 5):
>
> 1 task 72 tasks
> pseudo_nmi=0 3,195,656 230,824,534
> pseudo_nmi=1 2,253,753 163,914,837
> ratio 0.705 0.710
>
> The ratio is flat across the whole 1-to-72 sweep, so -- relevant to the
> scalability question -- it's a constant per-syscall tax, not a contention
> effect. The impact tracks syscall/exception density: page_fault1, a more
> realistic workload, stays within ~5%.
>
> > The direction of travel is to deprecate SDEI. I wouldn't add more stuff
> > on top of this interface.
>
> I understand FEAT_NMI is the long-term answer, and I'm not arguing against
> deprecating SDEI. My concern is the gap in between. By our estimate it's
> 10+ years before the last non-FEAT_NMI machine retires from the fleet --
> for scale, we're still running Skylake today. So there's roughly a
> decade where a large installed base has neither FEAT_NMI nor affordable
> pseudo-NMI, and no way to reach a DAIF-masked CPU for an all-CPU
> backtrace or to capture a wedged CPU in a crash dump. That's the
> functional gap this series tries to cover.
>
> Given the deprecation direction, I deliberately kept the SDEI footprint as
> small as I could. The series adds no new firmware interface and no vendor
> SMC -- it uses only the standard software-signalled event (event 0) via
> SDEI_EVENT_SIGNAL, which is already present on these systems for
> firmware-first RAS (APEI/GHES). And SDEI is only ever invoked in a "bad
> state": to deliver a backtrace signal to a CPU that a normal IPI can't
> reach, or to stop a CPU that ignored the stop IPIs. Nothing on any hot or
> steady-state path touches it.
>
> If even that minimal use is unacceptable on a deprecated interface, I'd
> rather know now and redirect the effort -- but I'd appreciate a pointer to
> what should cover this gap for existing silicon in the meantime.
I couldn't agree more: We need a solution for existing system. And
like to see this patchset merged. Thanks.
Regards
Yin, Fengwei
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
>