Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
From: Andy Lutomirski
Date: Wed Jun 17 2015 - 10:24:20 EST
On Wed, Jun 17, 2015 at 3:32 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
>> The main things that are missing are that I haven't done the 32-bit parts
>> (anyone want to help?) and therefore I haven't deleted the old C code. I also
>> think this may break UML for trivial reasons.
>
> So I'd suggest moving most of the SYSRET fast path to C too.
>
> This is how it looks like now after your patches:
>
> testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> jnz tracesys
> entry_SYSCALL_64_fastpath:
> #if __SYSCALL_MASK == ~0
> cmpq $__NR_syscall_max, %rax
> #else
> andl $__SYSCALL_MASK, %eax
> cmpl $__NR_syscall_max, %eax
> #endif
> ja 1f /* return -ENOSYS (already in pt_regs->ax) */
> movq %r10, %rcx
> call *sys_call_table(, %rax, 8)
> movq %rax, RAX(%rsp)
> 1:
> /*
> * Syscall return path ending with SYSRET (fast path).
> * Has incompletely filled pt_regs.
> */
> LOCKDEP_SYS_EXIT
> /*
> * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
> * it is too small to ever cause noticeable irq latency.
> */
> DISABLE_INTERRUPTS(CLBR_NONE)
>
> /*
> * We must check ti flags with interrupts (or at least preemption)
> * off because we must *never* return to userspace without
> * processing exit work that is enqueued if we're preempted here.
> * In particular, returning to userspace with any of the one-shot
> * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
> * very bad.
> */
> testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
> jnz int_ret_from_sys_call_irqs_off /* Go to the slow path */
>
> Most of that can be done in C.
>
> And I think we could also convert the IRET syscall return slow path to C too:
>
> GLOBAL(int_ret_from_sys_call)
> SAVE_EXTRA_REGS
> movq %rsp, %rdi
> call syscall_return_slowpath /* returns with IRQs disabled */
> RESTORE_EXTRA_REGS
>
> /*
> * Try to use SYSRET instead of IRET if we're returning to
> * a completely clean 64-bit userspace context.
> */
> movq RCX(%rsp), %rcx
> movq RIP(%rsp), %r11
> cmpq %rcx, %r11 /* RCX == RIP */
> jne opportunistic_sysret_failed
>
> /*
> * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
> * in kernel space. This essentially lets the user take over
> * the kernel, since userspace controls RSP.
> *
> * If width of "canonical tail" ever becomes variable, this will need
> * to be updated to remain correct on both old and new CPUs.
> */
> .ifne __VIRTUAL_MASK_SHIFT - 47
> .error "virtual address width changed -- SYSRET checks need update"
> .endif
>
> /* Change top 16 bits to be the sign-extension of 47th bit */
> shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>
> /* If this changed %rcx, it was not canonical */
> cmpq %rcx, %r11
> jne opportunistic_sysret_failed
>
> cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
> jne opportunistic_sysret_failed
>
> movq R11(%rsp), %r11
> cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
> jne opportunistic_sysret_failed
>
> /*
> * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
> * restoring TF results in a trap from userspace immediately after
> * SYSRET. This would cause an infinite loop whenever #DB happens
> * with register state that satisfies the opportunistic SYSRET
> * conditions. For example, single-stepping this user code:
> *
> * movq $stuck_here, %rcx
> * pushfq
> * popq %r11
> * stuck_here:
> *
> * would never get past 'stuck_here'.
> */
> testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
> jnz opportunistic_sysret_failed
>
> /* nothing to check for RSP */
>
> cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
> jne opportunistic_sysret_failed
>
> /*
> * We win! This label is here just for ease of understanding
> * perf profiles. Nothing jumps here.
> */
> syscall_return_via_sysret:
> /* rcx and r11 are already restored (see code above) */
> RESTORE_C_REGS_EXCEPT_RCX_R11
> movq RSP(%rsp), %rsp
> USERGS_SYSRET64
>
> opportunistic_sysret_failed:
> SWAPGS
> jmp restore_c_regs_and_iret
> END(entry_SYSCALL_64)
>
>
> Basically there would be a single C function we'd call, which returns a condition
> (or fixes up its return address on the stack directly) to determine between the
> SYSRET and IRET return paths.
>
> Moving this to C too has immediate benefits: that way we could easily add
> instrumentation to see how efficient these various return methods are, etc.
>
> I.e. I don't think there's two ways about this: once the entry code moves to the
> domain of C code, we get the best benefits by moving as much of it as possible.
This is almost certainly true. There are a lot more cleanups possible here.
I want to nail down the 32-bit case first so we can delete the old code.
>
> The only low level bits remaining in assembly will be low level hardware ABI
> details: saving registers and restoring registers to the expected format - no
> 'active' code whatsoever.
I think this is true for syscalls. Getting the weird special cases
(IRET and GS fault) for error_entry to work correctly in C could be
tricky.
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/