Re: [PATCH] arm64: entry: Improve the performance of system calls

From: Punit Agrawal
Date: Mon Sep 13 2021 - 19:02:19 EST


Hi Zhen Lei,

Zhen Lei <thunder.leizhen@xxxxxxxxxx> writes:

> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
> of functions from assembly to C, this greatly improves readability. But
> el0_svc()/el0_svc_compat() is in response to system call requests from
> user mode and may be in the hot path.
>
> Although the SVC is in the first case of the switch statement in C, the
> compiler optimizes the switch statement as a whole, and does not give SVC
> a small boost.
>
> Use "likely()" to help SVC directly invoke its handler after a simple
> judgment to avoid entering the switch table lookup process.
>
> After:
> 0000000000000ff0 <el0t_64_sync_handler>:
> ff0: d503245f bti c
> ff4: d503233f paciasp
> ff8: a9bf7bfd stp x29, x30, [sp, #-16]!
> ffc: 910003fd mov x29, sp
> 1000: d5385201 mrs x1, esr_el1
> 1004: 531a7c22 lsr w2, w1, #26
> 1008: f100545f cmp x2, #0x15
> 100c: 540000a1 b.ne 1020 <el0t_64_sync_handler+0x30>
> 1010: 97fffe14 bl 860 <el0_svc>
> 1014: a8c17bfd ldp x29, x30, [sp], #16
> 1018: d50323bf autiasp
> 101c: d65f03c0 ret
> 1020: f100705f cmp x2, #0x1c
>
> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
> about 10ns.
>
> Before:
> Simple syscall: 0.2365 microseconds
> Simple syscall: 0.2354 microseconds
> Simple syscall: 0.2339 microseconds
>
> After:
> Simple syscall: 0.2255 microseconds
> Simple syscall: 0.2254 microseconds
> Simple syscall: 0.2256 microseconds

I was curious about the impact of the patch on other
micro-architectures. Following are the results from using the patch
applied to v5.14 on A72 and A53 on a RK399.

For the A72 -
Before:
Simple syscall: 0.4311 microseconds
Simple syscall: 0.4311 microseconds
Simple syscall: 0.4313 microseconds

After:
Simple syscall: 0.4249 microseconds
Simple syscall: 0.4248 microseconds
Simple syscall: 0.4248 microseconds

For the A53 -
Before:
Simple syscall: 0.4130 microseconds
Simple syscall: 0.4128 microseconds
Simple syscall: 0.4124 microseconds

After:
Simple syscall: 0.4031 microseconds
Simple syscall: 0.4078 microseconds
Simple syscall: 0.4030 microseconds

Although there is a small benefit, they are not as big as on your board
/ micro-architecture.

Were you able to see any impact on real workloads?

I imagine that other code paths in the sync handler would also benefit
from similar special casing - did you try any others. Page fault
handling came to mind.

Overall, I feel a little uneasy about the special casing introduced here
but at the same time see that it does benefit certain workloads.

One more comment below.

>
> Signed-off-by: Zhen Lei <thunder.leizhen@xxxxxxxxxx>
> ---
> arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
> asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
> {
> unsigned long esr = read_sysreg(esr_el1);
> + unsigned long ec = ESR_ELx_EC(esr);
>
> - switch (ESR_ELx_EC(esr)) {
> - case ESR_ELx_EC_SVC64:
> + if (likely(ec == ESR_ELx_EC_SVC64)) {
> el0_svc(regs);
> - break;
> + return;
> + }
> +
> + switch (ec) {
> case ESR_ELx_EC_DABT_LOW:
> el0_da(regs, esr);
> break;

Please include a big fat comment on why SVC (or any other patch) is
being separated out of the switch case - both here and below.

Thanks,
Punit

> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
> asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
> {
> unsigned long esr = read_sysreg(esr_el1);
> + unsigned long ec = ESR_ELx_EC(esr);
>
> - switch (ESR_ELx_EC(esr)) {
> - case ESR_ELx_EC_SVC32:
> + if (likely(ec == ESR_ELx_EC_SVC32)) {
> el0_svc_compat(regs);
> - break;
> + return;
> + }
> +
> + switch (ec) {
> case ESR_ELx_EC_DABT_LOW:
> el0_da(regs, esr);
> break;