Re: [PATCH v2 00/36] x86: Rewrite all syscall entries except native 64-bit
From: Ingo Molnar
Date: Fri Oct 09 2015 - 09:07:07 EST
* Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> The first two patches are optimizations that I'm surprised we didn't
> already have. I noticed them when I was looking at the generated
> asm.
>
> The next two patches are tests and some old stuff. There's a test
> that validates the vDSO AT_SYSINFO annotations. There's also a test
> that exercises some assumptions that signal handling and ptracers
> make about syscalls that currently do *not* hold on 64-bit AMD using
> 32-bit AT_SYSINFO.
>
> The next three patches are NT cleanups and a lockdep cleanup.
>
> It may pay to apply the beginning of the series (at most through
> "x86/entry/64/compat: After SYSENTER, move STI after the NT fixup")
> without waiting for everyone to wrap their heads around the rest.
>
> The rest is basically a rewrite of syscalls for all cases except
> 64-bit native. With these patches applied, there is a single 32-bit
> vDSO and it uses SYSCALL, SYSENTER, and INT80 almost interchangeably
> via alternatives. The semantics of SYSENTER and SYSCALL are defined
> as:
>
> 1. If SYSCALL, ESP = ECX
> 2. ECX = *ESP
> 3. IP = INT80 landing pad
> 4. Opportunistic SYSRET/SYSEXIT is enabled on return
>
> The vDSO is rearranged so that these semantics work. Anything that
> backs IP up by 2 ends up pointing at a bona fide int $0x80
> instruction with the expected regs.
>
> In the process, the vDSO CFI annotations (which are actually used)
> get rewritten using normal CFI directives.
>
> Opportunistic SYSRET/SYSEXIT only happens on return when CS and SS
> are as expected, IP points to the INT80 landing pad, and flags are
> in good shape. (There is no longer any assumption that full
> fast-path 32-bit syscalls don't muck with the registers that matter
> for fast exits -- I played with maintaining an optimization like
> that with poor results. I may try again if it saves a few cycles.)
>
> Other than that, the system call entries are simplified to the bare
> minimum prologue and a call to a C function. Amusingly, SYSENTER
> and SYSCALL32 use the same C function.
>
> To make that work, I had to remove all the 32-bit syscall stubs
> except the clone argument hack. This is because, for C code to call
> through the system call table, the system call table entries need to
> be real function pointers with C-compatible ABIs.
>
> There is nothing at all anymore that requires that x86_32 syscalls
> be asmlinkage. That could be removed in a subsequent patch.
>
> The upshot appears to be a ~16 cycle performance hit on 32-bit fast
> path syscalls. (On my system, my little prctl test takes 172 cycles
> before and 188 cycles with these patches applied.)
>
> The slow path is probably faster under most circumstances and, if
> the exit slow path gets hit, it'll be much faster because (as we
> already do in the 64-bit native case) we can still use
> SYSEXIT/SYSRET.
>
> The patchset is structured as a removal of the old fast syscall
> code, then the change that makes syscalls into real functions, then
> a clean re-implementation of fast syscalls.
>
> If we want some of the 25 cycles back, we could consider open-coding
> a new C fast path.
>
> Changes from v1:
> - The unwind_vdso_32 test now warns on broken Debian installations
> instead of failing. The problem is now fully understood, will
> be fixed by Debian and possibly also fixed by upstream glibc.
> - execve was rather broken in v1.
> - It's quite a bit faster now (the optimizations at the end are mostly new).
> - int80 on 64-bit no longer clobbers extra regs (thanks Denys!).
> - The uaccess stuff is new.
> - Lots of other things that I forgot, I'm sure.
>
> Andy Lutomirski (36):
> x86/uaccess: Tell the compiler that uaccess is unlikely to fault
> x86/uaccess: __chk_range_not_ok is unlikely to return true
> selftests/x86: Add a test for vDSO unwinding
> selftests/x86: Add a test for syscall restart and arg modification
> x86/entry/64/compat: Fix SYSENTER's NT flag before user memory access
> x86/entry: Move lockdep_sys_exit to prepare_exit_to_usermode
> x86/entry/64/compat: After SYSENTER, move STI after the NT fixup
> x86/vdso: Remove runtime 32-bit vDSO selection
> x86/asm: Re-add manual CFI infrastructure
> x86/vdso: Define BUILD_VDSO while building and emit .eh_frame in asm
> x86/vdso: Replace hex int80 CFI annotations with gas directives
> x86/elf/64: Clear more registers in elf_common_init
> x86/vdso/32: Save extra registers in the INT80 vsyscall path
> x86/entry/64/compat: Disable SYSENTER and SYSCALL32 entries
> x86/entry/64/compat: Remove audit optimizations
> x86/entry/64/compat: Remove most of the fast system call machinery
> x86/entry/64/compat: Set up full pt_regs for all compat syscalls
> x86/entry/syscalls: Move syscall table declarations into
> asm/syscalls.h
> x86/syscalls: Give sys_call_ptr_t a useful type
> x86/entry: Add do_syscall_32, a C function to do 32-bit syscalls
> x86/entry/64/compat: Migrate the body of the syscall entry to C
> x86/entry: Add C code for fast system call entries
> x86/vdso/compat: Wire up SYSENTER and SYSCSALL for compat userspace
> x86/entry/compat: Implement opportunistic SYSRETL for compat syscalls
> x86/entry/32: Open-code return tracking from fork and kthreads
> x86/entry/32: Switch INT80 to the new C syscall path
> x86/entry/32: Re-implement SYSENTER using the new C path
> x86/asm: Remove thread_info.sysenter_return
> x86/entry: Remove unnecessary IRQ twiddling in fast 32-bit syscalls
> x86/entry: Make irqs_disabled checks in exit code depend on lockdep
> x86/entry: Force inlining of 32-bit syscall code
> x86/entry: Micro-optimize compat fast syscall arg fetch
> x86/entry: Hide two syscall entry assertions behind CONFIG_DEBUG_ENTRY
> x86/entry: Use pt_regs_to_thread_info() in syscall entry tracing
> x86/entry: Split and inline prepare_exit_to_usermode
> x86/entry: Split and inline syscall_return_slowpath
>
> arch/x86/Makefile | 10 +-
> arch/x86/entry/common.c | 255 ++++++++--
> arch/x86/entry/entry_32.S | 184 +++----
> arch/x86/entry/entry_64.S | 9 +-
> arch/x86/entry/entry_64_compat.S | 541 +++++----------------
> arch/x86/entry/syscall_32.c | 9 +-
> arch/x86/entry/syscall_64.c | 4 +-
> arch/x86/entry/syscalls/syscall_32.tbl | 12 +-
> arch/x86/entry/vdso/Makefile | 39 +-
> arch/x86/entry/vdso/vdso2c.c | 2 +-
> arch/x86/entry/vdso/vdso32-setup.c | 28 +-
> arch/x86/entry/vdso/vdso32/int80.S | 56 ---
> arch/x86/entry/vdso/vdso32/syscall.S | 75 ---
> arch/x86/entry/vdso/vdso32/sysenter.S | 116 -----
> arch/x86/entry/vdso/vdso32/system_call.S | 57 +++
> arch/x86/entry/vdso/vma.c | 13 +-
> arch/x86/ia32/ia32_signal.c | 4 +-
> arch/x86/include/asm/dwarf2.h | 177 +++++++
> arch/x86/include/asm/elf.h | 10 +-
> arch/x86/include/asm/syscall.h | 14 +-
> arch/x86/include/asm/thread_info.h | 1 -
> arch/x86/include/asm/uaccess.h | 14 +-
> arch/x86/include/asm/vdso.h | 10 +-
> arch/x86/kernel/asm-offsets.c | 3 -
> arch/x86/kernel/signal.c | 4 +-
> arch/x86/um/sys_call_table_32.c | 7 +-
> arch/x86/um/sys_call_table_64.c | 7 +-
> arch/x86/xen/setup.c | 13 +-
> tools/testing/selftests/x86/Makefile | 5 +-
> tools/testing/selftests/x86/ptrace_syscall.c | 294 +++++++++++
> .../testing/selftests/x86/raw_syscall_helper_32.S | 46 ++
> tools/testing/selftests/x86/unwind_vdso.c | 209 ++++++++
> 32 files changed, 1258 insertions(+), 970 deletions(-)
> delete mode 100644 arch/x86/entry/vdso/vdso32/int80.S
> delete mode 100644 arch/x86/entry/vdso/vdso32/syscall.S
> delete mode 100644 arch/x86/entry/vdso/vdso32/sysenter.S
> create mode 100644 arch/x86/entry/vdso/vdso32/system_call.S
> create mode 100644 arch/x86/include/asm/dwarf2.h
> create mode 100644 tools/testing/selftests/x86/ptrace_syscall.c
> create mode 100644 tools/testing/selftests/x86/raw_syscall_helper_32.S
> create mode 100644 tools/testing/selftests/x86/unwind_vdso.c
Ok, so I applied all of them to tip:x86/asm, in two phases, with small (stylistic)
edits - it all seems to work fine for me so far, so I pushed it all out to -tip
and linux-next.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/