Re: Proposal for finishing the 64-bit x86 syscall cleanup

From: Andy Lutomirski
Date: Wed Aug 26 2015 - 23:38:47 EST


On Wed, Aug 26, 2015 at 8:13 PM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
> On Wed, Aug 26, 2015 at 1:10 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Tue, Aug 25, 2015 at 10:20 PM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
>>>>>> Thing 2: vdso compilation with binutils that doesn't support .cfi directives
>>>>>>
>>>>>> Userspace debuggers really like having the vdso properly
>>>>>> CFI-annotated, and the 32-bit fast syscall entries are annotatied
>>>>>> manually in hexidecimal. AFAIK Jan Beulich is the only person who
>>>>>> understands it.
>>>>>>
>>>>>> I want to be able to change the entries a little bit to clean them up
>>>>>> (and possibly rework the SYSCALL32 and SYSENTER register tricks, which
>>>>>> currently suck), but it's really, really messy right now because of
>>>>>> the hex CFI stuff. Could we just drop the CFI annotations if the
>>>>>> binutils version is too old or even just require new enough binutils
>>>>>> to build 32-bit and compat kernels?
>>>>>
>>>>> One thing I want to do is rework the 32-bit VDSO into a single image,
>>>>> using alternatives to handle the selection of entry method. The
>>>>> open-coded CFI crap has made that near impossible to do.
>>>>>
>>>>
>>>> Yes please!
>>>>
>>>> But please don't change the actual instruction ordering at all yet,
>>>> since the SYSCALL case seems to be buggy right now.
>>>>
>>>> (If you want to be really fancy, don't use alternatives. Instead
>>>> teach vdso2c to annotate the actual dynamic table function pointers so
>>>> we can rewrite the pointers at boot time. That will save a cycle or
>>>> two.)
>>>
>>> The easiest way to select the right entry code is by changing the ELF
>>> AUX vector. That gets the normal usage, but there are two additional
>>> cases that need addressing.
>>>
>>> 1) Some code could possibly lookup the __kernel_vsyscall symbol
>>> directly and call it, but that's non-standard. If there is code out
>>> there that does this, we could update the ELF symbol table to point
>>> __kernel_vsyscall to the chosen entry point, or just remove the symbol
>>> and let the caller fall back to INT80.
>>
>> Here's an alternate proposal, which is mostly copied from what I
>> posted here yesterday afternoon:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=101061
>>
>> I think we should consider doing this:
>>
>> __kernel_vsyscall:
>> push %ecx
>> push %edx
>> movl %esp, %edx
>>
>> ALTERNATIVE (Intel with SEP):
>> sysenter
>>
>> ALTERNATIVE (AMD with SYSCALL32 on 64-bit kernels):
>> syscall
>> hlt /* keep weird binary tracers from utterly screwing up */
>>
>> ALTERNATIVE (if neither of the other cases apply):
>> nops
>>
>> movl %edx, %esp
>> movl (%esp), %edx
>> movl 8(%esp), %ecx
>> int $0x80
>> vsyscall_after_int80:
>> popl %edx
>> popl %ecx
>> ret
>
> This could interfere with sigreturn/ptrace, since if EDX or ECX are
> changed those would get overwritten by the pops from the stack.
> That's a problem with the current code too.
>

Anyone who ptraces a fast 32-bit syscall, changes ecx or edx, and
doesn't change it back is asking for serious trouble. Glibc has a
bunch of asm that sticks ebx (which is preserved across function calls
per the C ABI) into ecx or edx across syscalls with few enough args,
and, if those change, then ebx is toast.

I don't see any clean way to fix it.

>> First, in the case where we have neither SEP nor SYSCALL32, I claim
>> that this Just Works. We push a couple regs, pointlessly shuffle esp,
>> restore the regs, do int $0x80 with the same regs we started with, and
>> then (again, pointlessly) pop the regs we pushed.
>
> If neither SYSENTER or SYSCALL are supported then it should just be
> "int $0x80; ret;", nothing more. You can't assume all int80 calls
> come from the VDSO.

I don't understand. I'm talking about having __kernel_vsyscall
contain the extra prologue and epilogue code on non-SEP 32-bit
systems. This won't have any effect on bare int80 instructions.

> In fact, fork/clone cannot use the VDSO (the
> saved values on the stack are not copied to the child) and always uses
> int80 directly.

I see why it's screwed up for threads, but why is fork a problem?
Isn't the entire stack copied to the child?

In any event, if we wanted to make the vsyscall work for shared-VM
clone, we have a bigger problem than ecx and edx: the actual return
address gets zapped. Fixing *that* would be an incredible mess, and,
if anyone actually wanted to try it, they'd probably want to just add
a new entry point for clone.

>
>> Now we make the semantics of *both* syscall32 and sysenter that they
>> load edx into regs->sp, fetch regs->dx and regs->cx from memory, and
>> set regs->ip to vsyscall_after_int80. (This is a wee bit slower than
>> the current sysenter code because it does two memory fetches instead
>> of one.) Then they proceed just like int80. In particular, anything
>> that does "ip -= 2" works exactly like int80 because it points at an
>> actual int80 instruction.
>>
>> Note that sysenter's slow path already sort of works like this.
>>
>> Now we've fixed the correctness issues but we've killed performance,
>> as we'll use IRET instead of SYSRET to get back out. We can fix that
>> using opportunstic sysret. If we're returning from a compat syscall
>> that entered via sysenter or syscall32, if regs->ip ==
>> vsyscall_after_int80, regs->r11 == regs->flags, regs->ss == __USER_DS,
>> regs->cs == __USER32_CS, and flags are sane, then return using
>> SYSRETL. (The r11 check is probably unnecessary.)
>>
>> This is not quite as elegant as 64-bit opportunistic sysret, since
>> we're zapping ecx. This should be unobservable except by debuggers,
>> since we already know that we're returning to a 'pop ecx' instruction.
>>
>> NB: I don't think we can enable SYSCALL on 32-bit kernels. IIRC
>> there's no MSR_SYSCALL_MASK support, which makes the whole thing
>> basically useless, since we can't mask TF and we don't control ESP.
>
> The original implementation of SYSCALL on the K6 had a bad side
> effect. You could only return to userspace with SYSRET. It set some
> internal state that caused IRET to fault, which meant you couldn't
> context switch to another task that had taken a page fault for
> example. That made it unusable without some ugly hacks. It may work
> differently on modern AMD processors in legacy mode, but it's probably
> not worth trying to support it.
>

I don't know about that, but even the modern implementation screws up
SYSRET. If you enter via an interrupt and exit via SYSRET, SS ends up
unusable even though the selector is correct. That's the
"sysret_ss_attrs" thing that we now hack around.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/