Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

From: Ingo Molnar
Date: Wed Jun 17 2015 - 06:32:45 EST

* Andy Lutomirski <luto@xxxxxxxxxx> wrote:

> The main things that are missing are that I haven't done the 32-bit parts
> (anyone want to help?) and therefore I haven't deleted the old C code. I also
> think this may break UML for trivial reasons.

So I'd suggest moving most of the SYSRET fast path to C too.

This is how it looks like now after your patches:

jnz tracesys
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
call *sys_call_table(, %rax, 8)
movq %rax, RAX(%rsp)
* Syscall return path ending with SYSRET (fast path).
* Has incompletely filled pt_regs.
* We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
* it is too small to ever cause noticeable irq latency.

* We must check ti flags with interrupts (or at least preemption)
* off because we must *never* return to userspace without
* processing exit work that is enqueued if we're preempted here.
* In particular, returning to userspace with any of the one-shot
* very bad.
jnz int_ret_from_sys_call_irqs_off /* Go to the slow path */

Most of that can be done in C.

And I think we could also convert the IRET syscall return slow path to C too:

movq %rsp, %rdi
call syscall_return_slowpath /* returns with IRQs disabled */

* Try to use SYSRET instead of IRET if we're returning to
* a completely clean 64-bit userspace context.
movq RCX(%rsp), %rcx
movq RIP(%rsp), %r11
cmpq %rcx, %r11 /* RCX == RIP */
jne opportunistic_sysret_failed

* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
* in kernel space. This essentially lets the user take over
* the kernel, since userspace controls RSP.
* If width of "canonical tail" ever becomes variable, this will need
* to be updated to remain correct on both old and new CPUs.
.error "virtual address width changed -- SYSRET checks need update"

/* Change top 16 bits to be the sign-extension of 47th bit */
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx

/* If this changed %rcx, it was not canonical */
cmpq %rcx, %r11
jne opportunistic_sysret_failed

cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
jne opportunistic_sysret_failed

movq R11(%rsp), %r11
cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
jne opportunistic_sysret_failed

* SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET,
* restoring TF results in a trap from userspace immediately after
* SYSRET. This would cause an infinite loop whenever #DB happens
* with register state that satisfies the opportunistic SYSRET
* conditions. For example, single-stepping this user code:
* movq $stuck_here, %rcx
* pushfq
* popq %r11
* stuck_here:
* would never get past 'stuck_here'.
testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
jnz opportunistic_sysret_failed

/* nothing to check for RSP */

cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
jne opportunistic_sysret_failed

* We win! This label is here just for ease of understanding
* perf profiles. Nothing jumps here.
/* rcx and r11 are already restored (see code above) */
movq RSP(%rsp), %rsp

jmp restore_c_regs_and_iret

Basically there would be a single C function we'd call, which returns a condition
(or fixes up its return address on the stack directly) to determine between the
SYSRET and IRET return paths.

Moving this to C too has immediate benefits: that way we could easily add
instrumentation to see how efficient these various return methods are, etc.

I.e. I don't think there's two ways about this: once the entry code moves to the
domain of C code, we get the best benefits by moving as much of it as possible.

The only low level bits remaining in assembly will be low level hardware ABI
details: saving registers and restoring registers to the expected format - no
'active' code whatsoever.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at