Re: [PATCH] arm64: entry: Improve the performance of system calls

From: Joey Gouly
Date: Tue Sep 14 2021 - 11:18:08 EST


Hi,

On Tue, Sep 14, 2021 at 10:55:16AM +0100, Mark Rutland wrote:
> Hi,
>
> At a high-level, I'm not too keen on special-casing things unless
> necessary.
>
> I wonder if we could get similar results without special-casing by using
> a static const array of handlers indexed by the EC, since (with GCC
> 11.1.0 from the kernel.org crosstool page) that can result in code like:
>
> 0000000000001010 <el0t_64_sync_handler>:
> 1010: d503245f bti c
> 1014: d503233f paciasp
> 1018: a9bf7bfd stp x29, x30, [sp, #-16]!
> 101c: 910003fd mov x29, sp
> 1020: d5385201 mrs x1, esr_el1
> 1024: 90000002 adrp x2, 0 <el0t_64_sync_handlers>
> 1028: 531a7c23 lsr w3, w1, #26
> 102c: 91000042 add x2, x2, #:lo12:<el0t_64_sync_handlers>
> 1030: f8637842 ldr x2, [x2, x3, lsl #3]
> 1034: d63f0040 blr x2
> 1038: a8c17bfd ldp x29, x30, [sp], #16
> 103c: d50323bf autiasp
> 1040: d65f03c0 ret
>
> ... which might do better by virtue of reducing a chain of potential
> mispredicts down to a single potential mispredict, and dynamic branch
> prediction hopefully does a good job of predicting the common case at
> runtime. That said, the resulting tables will be pretty big...

I tested Mark's branch which implements this (found at
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry/switch-table)

I also took lmbench from https://github.com/intel/lmbench.git and built
`lat_syscall` with:

gcc lat_syscall.c lib_*.c -l m -o lat_syscall -static

These are the results I got from benchmarking on my MacBook Air M1, with
the following command:

./lat_syscall null &> /dev/null ; uname -a ; for i in 0 1 2 3 4 ; do ./lat_syscall null ; done

The kernel was based on arm64_defconfig that was then stripped of as much as possible.
GCC 11.1.0 from kernel.org crosstool page.
Clang build fom git b041b613e6fff713fc9ad6dbc73024286fb2fc93.

gcc:
master: 0.14300
switch-table: 0.14350
likely: 0.13962

clang:
master: 0.14354
switch-table: 0.14642
likely: 0.14256


The generated code looks similar to what Leizhen has posted, so I didn't
post it again.

So it seems the table approach actually performs worse in my testing,
and Leizhen's approach is slightly better than master (d0ee23f9d78be5531c4b055ea424ed0b489dfe9b).

Thanks,
Joey