On Mon, Jan 10, 2022 at 11:00:43AM +0800, He Ying wrote:
Hi Mark,Ok; do you see anything comparable with any tests available to upstream
I'm just back from the weekend and sorry for the delayed reply.
在 2022/1/7 21:19, Mark Rutland 写道:
On Fri, Jan 07, 2022 at 03:55:36AM -0500, He Ying wrote:Actuallly no, that's because this test is only available for the product
Our product has been updating its kernel from 4.4 to 5.10 recently andHave you tested with a recent mainline kernel, e.g. v5.15?
found a performance issue. We do a bussiness test called ARP test, which
tests the latency for a ping-pong packets traffic with a certain payload.
The result is as following.
- 4.4 kernel: avg = ~20s
- 5.10 kernel (CONFIG_ARM64_PSEUDO_NMI is not set): avg = ~40s
environment and
we don't have an available 5.15 kernel for it yet.
developers? e.g. hackbench or `perf bench sched` ?
Thanks for the CPU info; there are a number of other Cortex-A72 platforms outIs this test publicly available, and can you say which hardrware (e.g. whichActually no. The test is only available for our product environment now. We
CPU implementation) you're testing with?
are testing
with hisilicon 1213 (4 ARM Cortex-A72 cores).
there, so it might be possible to reproduce the behaviour elsewhere.
Code using alternatives is patched at runtime, so for:I don't understand alernatives very well and I'll apreciate it if you canI have been just learning arm64 pseudo-NMI code and have a question,The code in question is all patched via alternatives, and when
why is the related code not wrapped by CONFIG_ARM64_PSEUDO_NMI?
CONFIG_ARM64_PSEUDO_NMI is not selected, the code was expected to only have the
overhead of the regular DAIF manipulation.
explain it a bit more.
ALTERNATIVE(x, y, cap)
... the `x` instructions will be inline in the code, and the `y` instructions
will be placed in a separate linker section. If the `cap` capability is
detected, the `y` instructions will be copied over the `x` instructions.
So for:
| asm volatile(ALTERNATIVE(
| "msr daifclr, #3 // arch_local_irq_enable",
| __msr_s(SYS_ICC_PMR_EL1, "%0"),
| ARM64_HAS_IRQ_PRIO_MASKING)
| :
| : "r" ((unsigned long) GIC_PRIO_IRQON)
| : "memory");
... where ARM64_HAS_IRQ_PRIO_MASKING is not detected, the inline instructions
will be:
| msr daifclr, #3 // arch_local_irq_enable
A separate linker section will contain:
... and a separate linker section will contain:
| __msr_s(SYS_ICC_PMR_EL1, "%0"
... and some metadata will be recorded in a `struct alt_instr`, which will be
unused and will take some space, but should have no runtime overhead -- the
runtime cost should only be the `msr daifclr, #3` instruction.
Due to the large numbers, I suspect this must be due to a specific fast-path,Thanks for your detailed explaination! Actually I can't understand theI wonder if this brings some performance regression.I'm surprised the overhead is so significant; as above this is all patched in
First, I make this patch and then do the test again. Here's the result.
- 5.10 kernel with this patch not applied: avg = ~40s
- 5.10 kernel with this patch applied: avg = ~23s
Amazing! Note that all kernel is built with CONFIG_ARM64_PSEUDO_NMI not
set. It seems the pseudo-NMI feature actually brings some overhead to
performance event if CONFIG_ARM64_PSEUDO_NMI is not set.
and so the overhead when this is disabled is expected to be *extremely* small.
For example, wjen CONFIG_ARM64_PSEUDO_NMI, in arch_local_irq_enable():
* The portion under the system_has_prio_mask_debugging() test will be removed
entirely by the compiler, as this internally checks
IS_ENABLED(CONFIG_ARM64_PSEUDO_NMI).
* The assembly will be left as a write to DAIFClr. The only additional cost
should be that of generating GIC_PRIO_IRQON into a register.
* The pmr_sync() will be removed entirely by the compiler as is defined
conditionally dependent on CONFIG_ARM64_PSEUDO_NMI.
I can't spot an obvious issue with that or ther other cases. In the common case
those add no new instructions, and in the worst case they only add NOPs.
result exactly.
I build two 5.10 kernel images with this patch applied or not and objdump
them. Indeed, the disassembles of 'arch_local_irq_restore' are the same. Do
you have any ideas how we can find the root cause why this patch improves the
performance so much?
However, the test result is trustworthy because we do it many times and the
result is always repeatable.
and it's possible that this is due to secondary factors (e.g. alignment of
code) rather than the pseudo-NMK code itself.
We need to narrow down *where* time is being spent. Since it appears that this
is related to the local IRQ state management, it isn't likely that we can
determine that reliably with the PMU. Given that, I think the first step is to
reproduce the result elsewhere, for which we will need some plublicly available
test-case.
I got it. Thanks.
Yes; I meant the sections generated to hold the alternatives code and theDo you mean the sections generated by the alternatives? I don't understandFurthermore, I find the feature also brings some overhead to vmlinux size.I suspect that's just the (unused) alternatives, and we could improve that by
I build 5.10 kernel with this patch applied or not while
CONFIG_ARM64_PSEUDO_NMI is not set.
- 5.10 kernel with this patch not applied: vmlinux size is 384060600 Bytes.
- 5.10 kernel with this patch applied: vmlinux size is 383842936 Bytes.
That means arm64 pseudo-NMI feature may bring ~200KB overhead to
vmlinux size.
passing the config into the alternative blocks.
alernatives very well and I'll apreciate it if you can explain it a bit
more.
alternatives metadata.
Sorry; you are right, and my suggestion was broken.I agree. Adding these ifdeffery is a bit ugly. Let's see if there are someAbove all, arm64 pseudo-NMI feature brings some overhead to vmlinux sizeI'm happy to rework this to improve matters, but I am very much not happy with
and performance even if config is not set. To avoid it, add macro control
all around the related code.
Signed-off-by: He Ying <heying24@xxxxxxxxxx>
---
arch/arm64/include/asm/irqflags.h | 38 +++++++++++++++++++++++++++++--
arch/arm64/kernel/entry.S | 4 ++++
2 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/irqflags.h b/arch/arm64/include/asm/irqflags.h
index b57b9b1e4344..82f771b41cf5 100644
--- a/arch/arm64/include/asm/irqflags.h
+++ b/arch/arm64/include/asm/irqflags.h
@@ -26,6 +26,7 @@
*/
static inline void arch_local_irq_enable(void)
{
+#ifdef CONFIG_ARM64_PSEUDO_NMI
if (system_has_prio_mask_debugging()) {
u32 pmr = read_sysreg_s(SYS_ICC_PMR_EL1);
@@ -41,10 +42,18 @@ static inline void arch_local_irq_enable(void)
: "memory");
pmr_sync();
+#else
+ asm volatile(
+ "msr daifclr, #3 // arch_local_irq_enable"
+ :
+ :
+ : "memory");
+#endif
duplicating the logic for the !PSEUDO_NMI case. Adding more ifdeffery and
copies of that is not acceptable.
better ways.
Instead, can you please try changing the alternative to also take the config,OK, I'll try it. But I have some questions. Here's the comment of
e.g. here have:
| asm volatile(ALTERNATIVE(
| "msr daifclr, #3 // arch_local_irq_enable",
| __msr_s(SYS_ICC_PMR_EL1, "%0"),
| ARM64_HAS_IRQ_PRIO_MASKING,
| CONFIG_ARM64_PSEUDO_NMI)
| :
| : "r" ((unsigned long) GIC_PRIO_IRQON)
| : "memory");
... and see if that makes a significant difference?
Likewise for the other casees.
ALERNATIVE:
/*
* Usage: asm(ALTERNATIVE(oldinstr, newinstr, feature));
*
* Usage: asm(ALTERNATIVE(oldinstr, newinstr, feature, CONFIG_FOO));
* N.B. If CONFIG_FOO is specified, but not selected, the whole block
* will be omitted, including oldinstr.
*/
If CONFIG_FOO is not selected, the whole block will be omitted including
oldinstr.
But we still want the oldinstr in this situation. Do I misunderstand
something?
I'll need to have a think about this; we might be able to rework this to use a
static key instead, but IIRC last time I tried there were issues with that
approach. I have some (old) work-in-progress patches at:
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/daif-rework
I suspect I won't have time to renew that in the near future, but an approach
like that might be worthwhile.
Thanks,
Mark.
.