Re: [PATCH] arm64: Make CONFIG_ARM64_PSEUDO_NMI macro wrap all the pseudo-NMI code

From: He Ying
Date: Sun Jan 09 2022 - 22:21:35 EST


Hi Marc,

I'm just back from the weekend and sorry for the delayed reply.


在 2022/1/8 20:51, Marc Zyngier 写道:
On Fri, 07 Jan 2022 08:55:36 +0000,
He Ying <heying24@xxxxxxxxxx> wrote:
Our product has been updating its kernel from 4.4 to 5.10 recently and
found a performance issue. We do a bussiness test called ARP test, which
tests the latency for a ping-pong packets traffic with a certain payload.
The result is as following.

- 4.4 kernel: avg = ~20s
- 5.10 kernel (CONFIG_ARM64_PSEUDO_NMI is not set): avg = ~40s

I have been just learning arm64 pseudo-NMI code and have a question,
why is the related code not wrapped by CONFIG_ARM64_PSEUDO_NMI?
I wonder if this brings some performance regression.

First, I make this patch and then do the test again. Here's the result.

- 5.10 kernel with this patch not applied: avg = ~40s
- 5.10 kernel with this patch applied: avg = ~23s

Amazing! Note that all kernel is built with CONFIG_ARM64_PSEUDO_NMI not
set. It seems the pseudo-NMI feature actually brings some overhead to
performance event if CONFIG_ARM64_PSEUDO_NMI is not set.

Furthermore, I find the feature also brings some overhead to vmlinux size.
I build 5.10 kernel with this patch applied or not while
CONFIG_ARM64_PSEUDO_NMI is not set.

- 5.10 kernel with this patch not applied: vmlinux size is 384060600 Bytes.
- 5.10 kernel with this patch applied: vmlinux size is 383842936 Bytes.

That means arm64 pseudo-NMI feature may bring ~200KB overhead to
vmlinux size.

Above all, arm64 pseudo-NMI feature brings some overhead to vmlinux size
and performance even if config is not set. To avoid it, add macro control
all around the related code.
This obviously attracted my attention, and I took this patch for a
ride on 5.16-rc8 on a machine that doesn't support GICv3 NMIs to make
sure that any extra code would only result in pure overhead.

There was no measurable difference with this patch applied or not,
with CONFIG_ARM64_PSEUDO_NMI selected or not for the workloads I tried
(I/O heavy virtual machines, hackbench).
Our test is some kind of network test.

Mark already asked a number of questions (test case, implementation,
test on a modern kernel). Please provide as many detail as you
possibly can, because such a regression really isn't expected, and
doesn't show up on the systems I have at hand. Some profiling numbers
could also be interesting, in case this is a result of a particular
resource being thrashed (TLB, cache...).

I replied to Mark a few moments ago and provided as many details as I can.

You mentioned TLB and cache could be thrashed. How can we check this?

By using perf tools?


Thanks,

M.