Re: [PATCH] arm64: Make CONFIG_ARM64_PSEUDO_NMI macro wrap all the pseudo-NMI code

From: Marc Zyngier
Date: Sat Jan 08 2022 - 07:51:32 EST


On Fri, 07 Jan 2022 08:55:36 +0000,
He Ying <heying24@xxxxxxxxxx> wrote:
>
> Our product has been updating its kernel from 4.4 to 5.10 recently and
> found a performance issue. We do a bussiness test called ARP test, which
> tests the latency for a ping-pong packets traffic with a certain payload.
> The result is as following.
>
> - 4.4 kernel: avg = ~20s
> - 5.10 kernel (CONFIG_ARM64_PSEUDO_NMI is not set): avg = ~40s
>
> I have been just learning arm64 pseudo-NMI code and have a question,
> why is the related code not wrapped by CONFIG_ARM64_PSEUDO_NMI?
> I wonder if this brings some performance regression.
>
> First, I make this patch and then do the test again. Here's the result.
>
> - 5.10 kernel with this patch not applied: avg = ~40s
> - 5.10 kernel with this patch applied: avg = ~23s
>
> Amazing! Note that all kernel is built with CONFIG_ARM64_PSEUDO_NMI not
> set. It seems the pseudo-NMI feature actually brings some overhead to
> performance event if CONFIG_ARM64_PSEUDO_NMI is not set.
>
> Furthermore, I find the feature also brings some overhead to vmlinux size.
> I build 5.10 kernel with this patch applied or not while
> CONFIG_ARM64_PSEUDO_NMI is not set.
>
> - 5.10 kernel with this patch not applied: vmlinux size is 384060600 Bytes.
> - 5.10 kernel with this patch applied: vmlinux size is 383842936 Bytes.
>
> That means arm64 pseudo-NMI feature may bring ~200KB overhead to
> vmlinux size.
>
> Above all, arm64 pseudo-NMI feature brings some overhead to vmlinux size
> and performance even if config is not set. To avoid it, add macro control
> all around the related code.

This obviously attracted my attention, and I took this patch for a
ride on 5.16-rc8 on a machine that doesn't support GICv3 NMIs to make
sure that any extra code would only result in pure overhead.

There was no measurable difference with this patch applied or not,
with CONFIG_ARM64_PSEUDO_NMI selected or not for the workloads I tried
(I/O heavy virtual machines, hackbench).

Mark already asked a number of questions (test case, implementation,
test on a modern kernel). Please provide as many detail as you
possibly can, because such a regression really isn't expected, and
doesn't show up on the systems I have at hand. Some profiling numbers
could also be interesting, in case this is a result of a particular
resource being thrashed (TLB, cache...).

Thanks,

M.

--
Without deviation from the norm, progress is not possible.