Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

From: Denys Vlasenko
Date: Fri Apr 24 2015 - 06:00:27 EST


On 04/24/2015 04:15 AM, Andy Lutomirski wrote:
> AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET
> with SS == 0 results in an invalid usermode state in which SS is
> apparently equal to __USER_DS but causes #SS if used.
>
> Work around the issue by replacing NULL SS values with __KERNEL_DS
> in __switch_to, thus ensuring that SYSRET never happens with SS set
> to NULL.
>
> This was exposed by a recent vDSO cleanup.
>
> Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
> ---
>
> Tested only on Intel, which isn't very interesting. I'll tidy up
> and send a test case, too, once Borislav confirms that it works.
>
> Please don't actually apply this until we're sure we understand the
> scope of the issue. If this doesn't affect SYSRETQ, then we might
> to fix it on before SYSRETL to avoid impacting 64-bit processes
> at all.
>
> arch/x86/ia32/ia32entry.S | 5 +++++
> arch/x86/include/asm/cpufeature.h | 1 +
> arch/x86/kernel/cpu/amd.c | 3 +++
> arch/x86/kernel/entry_64.S | 6 ++++++
> arch/x86/kernel/process_64.c | 25 +++++++++++++++++++++++++
> 5 files changed, 40 insertions(+)
>
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 3c9fadf8b95c..ff519ea8b054 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -421,6 +421,11 @@ sysretl_from_sys_call:
> * cs and ss are loaded from MSRs.
> * (Note: 32bit->32bit SYSRET is different: since r11
> * does not exist, it merely sets eflags.IF=1).
> + *
> + * NB: On AMD CPUs with the X86_BUG_SYSRET_SS_ATTRS bug, the ss
> + * descriptor is not reinitialized. This means that we must
> + * avoid SYSRET with SS == NULL. We prevent that from happening
> + * by reloading SS in __switch_to.
> */
> USERGS_SYSRET32
>
> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> index 854c04b3c9c2..7e244f626301 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -257,6 +257,7 @@
> #define X86_BUG_11AP X86_BUG(5) /* Bad local APIC aka 11AP */
> #define X86_BUG_FXSAVE_LEAK X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
> #define X86_BUG_CLFLUSH_MONITOR X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
> +#define X86_BUG_SYSRET_SS_ATTRS X86_BUG(8) /* SYSRET doesn't fix up SS attrs */
>
> #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
>
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index fd470ebf924e..e4cf63301ff4 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -720,6 +720,9 @@ static void init_amd(struct cpuinfo_x86 *c)
> if (!cpu_has(c, X86_FEATURE_3DNOWPREFETCH))
> if (cpu_has(c, X86_FEATURE_3DNOW) || cpu_has(c, X86_FEATURE_LM))
> set_cpu_cap(c, X86_FEATURE_3DNOWPREFETCH);
> +
> + /* AMD CPUs don't reset SS attributes on SYSRET */
> + set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
> }
>
> #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 0034012d04ea..ffeaed0534d8 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -295,6 +295,12 @@ system_call_fastpath:
> * rflags from r11 (but RF and VM bits are forced to 0),
> * cs and ss are loaded from MSRs.
> * Restoration of rflags re-enables interrupts.
> + *
> + * NB: On AMD CPUs with the X86_BUG_SYSRET_SS_ATTRS bug, the ss
> + * descriptor is not reinitialized. This means that we should
> + * avoid SYSRET with SS == NULL. We prevent that from happening
> + * by reloading SS in __switch_to. (Actually detecting the failure
> + * in 64-bit userspace is tricky but can be done.)
> */
> USERGS_SYSRET64
>
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 4baaa972f52a..919d6c2abded 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -419,6 +419,31 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
> __switch_to_xtra(prev_p, next_p, tss);
>
> + if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) {
> + /*
> + * AMD CPUs have a misfeature: SYSRET sets the SS selector
> + * but does not update the cached descriptor. As a result,
> + * if we do SYSRET while SS is NULL, we'll end up in user
> + * mode with SS apparently equal to __USER_DS but actually
> + * unusable.
> + *
> + * The straightforward workaround would be to fix it up
> + * just before SYSRET, but that would slow down the system
> + * call fast paths. Instead, we ensure that SS is never NULL
> + * in system call context. We do this by replacing NULL
> + * SS selectors at every context switch. SYSCALL sets up
> + * a valid SS, so the only way to get NULL is to re-enter
> + * the kernel from CPL 3 through an interrupt. Since that
> + * can't happen in the same task as a running syscall, we
> + * are guaranteed to context switch between every interrupt
> + * vector entry and a subsequent SYSRET.
> + */
> + unsigned short ss_sel;
> + savesegment(ss, ss_sel);
> + if (ss_sel == 0)
> + loadsegment(ss, __KERNEL_DS);

I propose a more conservative check:

if (ss_sel != __KERNEL_DS)
loadsegment(ss, __KERNEL_DS);

I would propose this even if I would see no real case where it matters...
but I even do see such a case.

24593_APM_v21.pdf, page 110:

"""
Figure 4-35 on page 110 shows a long-mode stack switch. In long mode, all call gates must reference
64-bit code-segment descriptors, so a long-mode stack switch uses a 64-bit stack. The process of
switching stacks in long mode is similar to switching in legacy mode when no parameters are passed.
The process is as follows:

1. The target code-segment DPL is read by the processor and used as an index into the 64-bit TSS
for selecting the new stack pointer (RSP).

2. The RSP register is loaded with the new RSP value read from the TSS. The SS register is loaded
with a null selector (SS=0). Setting the new SS selector to null allows proper handling of nested
control transfers in 64-bit mode. See “Nested Returns to 64-Bit Mode Procedures” on page 112
for additional information.

As in legacy mode, it is desirable to keep the stack-segment requestor privilege-level (SS.RPL)
equal to the current privilege-level (CPL). When using a call gate to change privilege levels, the
SS.RPL is updated to reflect the new CPL. The SS.RPL is restored from the return-target CS.RPL
on the subsequent privilege-level-changing far return.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ THIS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

3. The old values of the SS and RSP registers are pushed onto the stack pointed to by the new RSP.
...
...
"""

Thus, the NULL selector in SS may actually be not all zeros. Its RPL may be nonzero,
if we are not in ring 0.

And in some paravirt setups kernel does run in a nonzero ring:

arch/x86/include/asm/paravirt.h:

#define get_kernel_rpl() (pv_info.kernel_rpl)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/