Re: [RFC 06/30] x86/sched/64: Don't save flags on context switch (reinstated)

From: Andy Lutomirski
Date: Thu Sep 24 2015 - 13:11:43 EST


On Tue, Sep 1, 2015 at 3:41 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> This reinstates 2c7577a75837 ("sched/x86_64: Don't save flags on
> context switch"), which was reverted in 512255a2ad2c.

Hi Ingo and Thomas-

I just realized that there's no good reason that this patch belongs
with the rest of the entry series -- it's totally independent. Should
I resend it by itself, or would you rather just apply it as is?

I'll send an updated entry series soon.

--Andy

>
> Historically, Linux has always saved and restored EFLAGS across
> context switches. As far as I know, the only reason to do this is
> because of the NT flag. In particular, if something calls switch_to
> with the NT flag set, then we don't want to leak the NT flag into a
> different task that might try to IRET and fail because NT is set.
>
> Before 8c7aa698baca ("x86_64, entry: Filter RFLAGS.NT on entry from
> userspace"), we could run system call bodies with NT set. This
> would be a DoS or possibly privilege escalation hole if scheduling
> in such a system call would leak NT into a different task.
>
> Importantly, we don't need to worry about NT being set while
> preemptible or across page faults. The only way we can schedule due
> to preemption or a page fault is in an interrupt entry that nests
> inside the SYSENTER prologue. The CPU will clear NT when entering
> through an interrupt gate, so we won't schedule with NT set.
>
> The only other interesting flags are IOPL and AC. Allowing
> switch_to to change IOPL has no effect, as the value loaded during
> kernel execution doesn't matter at all except between a SYSENTER
> entry and the subsequent PUSHF, and anythign that interrupts in that
> window will restore IOPL on return.
>
> If we call __switch_to with AC set, we have bigger problems.
>
> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
> ---
> arch/x86/include/asm/switch_to.h | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index d7f3b3b78ac3..751bf4b7bf11 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -79,12 +79,12 @@ do { \
> #else /* CONFIG_X86_32 */
>
> /* frame pointer must be last for get_wchan */
> -#define SAVE_CONTEXT "pushf ; pushq %%rbp ; movq %%rsi,%%rbp\n\t"
> -#define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp ; popf\t"
> +#define SAVE_CONTEXT "pushq %%rbp ; movq %%rsi,%%rbp\n\t"
> +#define RESTORE_CONTEXT "movq %%rbp,%%rsi ; popq %%rbp\t"
>
> #define __EXTRA_CLOBBER \
> , "rcx", "rbx", "rdx", "r8", "r9", "r10", "r11", \
> - "r12", "r13", "r14", "r15"
> + "r12", "r13", "r14", "r15", "flags"
>
> #ifdef CONFIG_CC_STACKPROTECTOR
> #define __switch_canary \
> @@ -100,7 +100,11 @@ do { \
> #define __switch_canary_iparam
> #endif /* CC_STACKPROTECTOR */
>
> -/* Save restore flags to clear handle leaking NT */
> +/*
> + * There is no need to save or restore flags, because flags are always
> + * clean in kernel mode, with the possible exception of IOPL. Kernel IOPL
> + * has no effect.
> + */
> #define switch_to(prev, next, last) \
> asm volatile(SAVE_CONTEXT \
> "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \
> --
> 2.4.3
>



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/