Re: [RFC 00/30] x86: Rewrite all syscall entries except native 64-bit

From: Andy Lutomirski
Date: Thu Sep 03 2015 - 13:19:11 EST


On Wed, Sep 2, 2015 at 10:23 PM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
> On Tue, Sep 1, 2015 at 6:41 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> Here's a monster series that I'm working on. I think it's in decent
>> shape now.
>>
>> The first couple patches are tests and some old stuff. There's a
>> test that validates the vDSO AT_SYSINFO annotations (which fails on
>> 32-bit Debian systems for some reason that I can't yet fathom
>> because fast syscalls simply don't happen on my VM for unknown
>> reasons presumably related to glibc bugs or misconfiguration, and I
>> need to do something about the test). There's also a test that
>> exercises some assumptions that signal handling and ptracers make
>> about syscalls that currently do *not* hold on 64-bit AMD using
>> 32-bit AT_SYSINFO.
>>
>> The next few patches are the NT stuff. Ingo, feel free to pretend
>> you don't see it until the merge window closes :)
>>
>> The rest is basically a rewrite of syscalls for all cases except
>> 64-bit native. With these patches applied, there is a single 32-bit
>> vDSO and it uses SYSCALL, SYSENTER, and INT80 almost interchangeably
>> via alternatives. The semantics of SYSENTER and SYSCALL are defined
>> as:
>>
>> 1. If SYSCALL, ESP = ECX
>> 2. ECX = *ESP
>> 3. IP = INT80 landing pad
>> 4. Opportunistic SYSRET/SYSEXIT is enabled on return
>>
>> The vDSO is rearranged so that these semantics work. Anything that
>> backs IP up by 2 ends up pointing at a bona fide int $0x80
>> instruction with the expected regs.
>>
>> In the process, the vDSO CFI annotations (which are actually used)
>> get rewritten using normal CFI directives.
>>
>> Opportunistic SYSRET/SYSEXIT only happens on return when CS and SS
>> are as expected, IP points to the INT80 landing pad, and flags are
>> in good shape.
>
> I think the opportunistic exit code could be improved a bit more. The
> checks are only be necessary if force_iret() was called meaning
> registers were changed. One possibility is to add a ti->status flag
> TS_FASTSYSCALL. Then we could move the tests to force_iret(), which
> would clear the flag if the registers fail validation. The syscall
> exit path then would check the flag and exit via IRET if it's clear.
> That would reduce the impact of the tests on the fast path where no
> regs were changed.

Historically, it's not just force_iret() (which is quite new) but
anything that triggers the slow path. If we want to go that route,
I'd be more comfortable doing something more like:

if (!(ti->flags & _TIF_SYSCALL_EXIT_WORK))
return true;

i.e. just bypassing the slow path exit and the check. This might get
some more of those cycles back for the full fast path, albeit at the
cost of more complexity in the C code.

Maybe some day we should add better accessors for pt_regs that warn if
misused and set flags if used for write. For example, const struct
pt_regs *syscall_pt_regs_read() and struct pt_regs
*syscall_pt_regs_write(). The latter could set a flag.

>
>> Other than that, the system call entries are simplified to the bare
>> minimum prologue and a call to a C function. Amusingly, SYSENTER
>> and SYSCALL32 use the same C function.
>>
>> To make that work, I had to remove all the 32-bit syscall stubs
>> except the clone argument hack. This is because, for C code to call
>> through the system call table, the system call table entries need to
>> be real function pointers with C-compatible ABIs.
>>
>> There is nothing at all anymore that requires that x86_32 syscalls
>> be asmlinkage. That could be removed in a subsequent patch.
>
> Other arches (at least IA-64) still need asmlinkage or something
> equivalent for their syscalls.

We should probably add a macro syscall_abi that expands to nothing on
x86 and to asmlinkage on IA-64. (Why asm "linkage"? It has nothing
to do with linkage.)

>
> asmlinkage_protect() can also be removed.

Wow, that's gross. I'm a bit surprised that no new compiler has
clever enough to break that hack.

>
>> The upshot appears to be a ~25 cycle performance hit on 32-bit fast
>> path syscalls. The slow path is probably faster under most
>> circumstances and, if the exit slow path gets hit, it'll be much
>> faster because (as we already do in the 64-bit native case) we can
>> still use SYSEXIT/SYSRET.
>>
>> The patchset is structured as a removal of the old fast syscall
>> code, then the change that makes syscalls into real functions, then
>> a clean re-implementation of fast syscalls.
>>
>> If we want some of the 25 cycles back, we could consider open-coding
>> a new C fast path.
>
> Is the 25 cycles for the compat or native case? I'd expect the native
> case to be hit harder because of register pressure.

Compat, which I find easier to benchmark because my 32-bit VM
steadfastly refuses to issue syscalls via AT_SYSINFO. (I can do it
manually, and static binaries built elsewhere work fine, but
something's wrong with its glibc.)

I'll benchmark native 32-bit soon.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/