On Fri, Jan 07, 2022 at 03:55:36AM -0500, He Ying wrote:
Our product has been updating its kernel from 4.4 to 5.10 recently andHave you tested with a recent mainline kernel, e.g. v5.15?
found a performance issue. We do a bussiness test called ARP test, which
tests the latency for a ping-pong packets traffic with a certain payload.
The result is as following.
- 4.4 kernel: avg = ~20s
- 5.10 kernel (CONFIG_ARM64_PSEUDO_NMI is not set): avg = ~40s
Is this test publicly available, and can you say which hardrware (e.g. which
CPU implementation) you're testing with?
I don't understand alernatives very well and I'll apreciate it if you can explain it a bit more.
I have been just learning arm64 pseudo-NMI code and have a question,The code in question is all patched via alternatives, and when
why is the related code not wrapped by CONFIG_ARM64_PSEUDO_NMI?
CONFIG_ARM64_PSEUDO_NMI is not selected, the code was expected to only have the
overhead of the regular DAIF manipulation.
I wonder if this brings some performance regression.I'm surprised the overhead is so significant; as above this is all patched in
First, I make this patch and then do the test again. Here's the result.
- 5.10 kernel with this patch not applied: avg = ~40s
- 5.10 kernel with this patch applied: avg = ~23s
Amazing! Note that all kernel is built with CONFIG_ARM64_PSEUDO_NMI not
set. It seems the pseudo-NMI feature actually brings some overhead to
performance event if CONFIG_ARM64_PSEUDO_NMI is not set.
and so the overhead when this is disabled is expected to be *extremely* small.
For example, wjen CONFIG_ARM64_PSEUDO_NMI, in arch_local_irq_enable():
* The portion under the system_has_prio_mask_debugging() test will be removed
entirely by the compiler, as this internally checks
IS_ENABLED(CONFIG_ARM64_PSEUDO_NMI).
* The assembly will be left as a write to DAIFClr. The only additional cost
should be that of generating GIC_PRIO_IRQON into a register.
* The pmr_sync() will be removed entirely by the compiler as is defined
conditionally dependent on CONFIG_ARM64_PSEUDO_NMI.
I can't spot an obvious issue with that or ther other cases. In the common case
those add no new instructions, and in the worst case they only add NOPs.
Furthermore, I find the feature also brings some overhead to vmlinux size.I suspect that's just the (unused) alternatives, and we could improve that by
I build 5.10 kernel with this patch applied or not while
CONFIG_ARM64_PSEUDO_NMI is not set.
- 5.10 kernel with this patch not applied: vmlinux size is 384060600 Bytes.
- 5.10 kernel with this patch applied: vmlinux size is 383842936 Bytes.
That means arm64 pseudo-NMI feature may bring ~200KB overhead to
vmlinux size.
passing the config into the alternative blocks.
I agree. Adding these ifdeffery is a bit ugly. Let's see if there are some better ways.
Above all, arm64 pseudo-NMI feature brings some overhead to vmlinux sizeI'm happy to rework this to improve matters, but I am very much not happy with
and performance even if config is not set. To avoid it, add macro control
all around the related code.
Signed-off-by: He Ying <heying24@xxxxxxxxxx>
---
arch/arm64/include/asm/irqflags.h | 38 +++++++++++++++++++++++++++++--
arch/arm64/kernel/entry.S | 4 ++++
2 files changed, 40 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/irqflags.h b/arch/arm64/include/asm/irqflags.h
index b57b9b1e4344..82f771b41cf5 100644
--- a/arch/arm64/include/asm/irqflags.h
+++ b/arch/arm64/include/asm/irqflags.h
@@ -26,6 +26,7 @@
*/
static inline void arch_local_irq_enable(void)
{
+#ifdef CONFIG_ARM64_PSEUDO_NMI
if (system_has_prio_mask_debugging()) {
u32 pmr = read_sysreg_s(SYS_ICC_PMR_EL1);
@@ -41,10 +42,18 @@ static inline void arch_local_irq_enable(void)
: "memory");
pmr_sync();
+#else
+ asm volatile(
+ "msr daifclr, #3 // arch_local_irq_enable"
+ :
+ :
+ : "memory");
+#endif
duplicating the logic for the !PSEUDO_NMI case. Adding more ifdeffery and
copies of that is not acceptable.
Instead, can you please try changing the alternative to also take the config,
e.g. here have:
| asm volatile(ALTERNATIVE(
| "msr daifclr, #3 // arch_local_irq_enable",
| __msr_s(SYS_ICC_PMR_EL1, "%0"),
| ARM64_HAS_IRQ_PRIO_MASKING,
| CONFIG_ARM64_PSEUDO_NMI)
| :
| : "r" ((unsigned long) GIC_PRIO_IRQON)
| : "memory");
... and see if that makes a significant difference?
Likewise for the other casees.
#endif /* __ASM_IRQFLAGS_H */For these two I think the ifdeffery is fine, but I'm surprised this has a
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 2f69ae43941d..ffc32d3d909a 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -300,6 +300,7 @@ alternative_else_nop_endif
str w21, [sp, #S_SYSCALLNO]
.endif
+#ifdef CONFIG_ARM64_PSEUDO_NMI
/* Save pmr */
alternative_if ARM64_HAS_IRQ_PRIO_MASKING
mrs_s x20, SYS_ICC_PMR_EL1
@@ -307,6 +308,7 @@ alternative_if ARM64_HAS_IRQ_PRIO_MASKING
mov x20, #GIC_PRIO_IRQON | GIC_PRIO_PSR_I_SET
msr_s SYS_ICC_PMR_EL1, x20
alternative_else_nop_endif
+#endif
/* Re-enable tag checking (TCO set on exception entry) */
#ifdef CONFIG_ARM64_MTE
@@ -330,6 +332,7 @@ alternative_else_nop_endif
disable_daif
.endif
+#ifdef CONFIG_ARM64_PSEUDO_NMI
/* Restore pmr */
alternative_if ARM64_HAS_IRQ_PRIO_MASKING
ldr x20, [sp, #S_PMR_SAVE]
@@ -339,6 +342,7 @@ alternative_if ARM64_HAS_IRQ_PRIO_MASKING
dsb sy // Ensure priority change is seen by redistributor
.L__skip_pmr_sync\@:
alternative_else_nop_endif
+#endif
measureable impact as the alternatives should be initialized to NOPS (and never
modified).
Thanks,
Mark.
.