Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base

From: Andy Lutomirski
Date: Fri Apr 10 2015 - 15:13:19 EST


On Fri, Apr 10, 2015 at 8:50 AM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
> From: Andi Kleen <ak@xxxxxxxxxxxxxxx>
>
> Introduction:
>
> IvyBridge added four new instructions to directly write the fs and gs
> 64bit base registers. Previously this had to be done with a system
> call to write to MSRs. The main use case is fast user space threading
> and switching the fs/gs registers quickly there. Another use
> case is having (relatively) cheap access to a new address
> register per thread.
>
> The instructions are opt-in and have to be explicitely enabled
> by the OS.
>
> For more details on how to use the instructions see
> Documentation/x86/fsgs.txt added in a followon patch.
>
> Paranoid exception path changes:
> ===============================
>
> The paranoid entry/exit code is used for any NMI like
> exception.
>
> Previously Linux couldn't support the new instructions
> because the paranoid entry code relied on the gs base never being
> negative outside the kernel to decide when to use swaps. It would
> check the gs MSR value and assume it was already running in
> kernel if negative.
>
> To get rid of this assumption we have to revamp the paranoid exception
> path to not rely on this. We can use the new instructions
> to get (relatively) quick access to the GS value, and use
> it directly.
>
> This is also significantly faster than a MSR read, so will speed
> NMIs (critical for profiling)
>
> The kernel gs for the paranoid path is now stored at the
> bottom of the IST stack (so that it can be derived from RSP).
> For this we need to know the size of the IST stack
> (4K or 8K), which is now passed in as a stack parameter
> to save_paranoid.
>
> The original patch compared the gs with the kernel gs and
> assumed that if it was identical, swapgs was not needed
> (and no user space processing was needed). This
> was nice and simple and didn't need a lot of changes.
>
> But this had the side effect that if a user process set its
> GS to the same as the kernel it may lose rescheduling
> checks (so a racing reschedule IPI would have been
> only acted upon the next non paranoid interrupt)
>
> This version now switches to full save/restore of the GS.
> This requires quite some changes in the paranoid path.
> Unfortunately I didn't come up with a simpler scheme.
>
> Previously we had a flag in EBX that indicated whether
> SWAPGS needs to be called later or not. In the new scheme
> this turns into a tristate, with a new "restore from R15"
> mode that is used when the FSGSBASE instructions are available.
> In this case the GS base is saved and restored.
> The exit paths are all adjusted to handle this correctly.
>
> So the three possible states for the paranoid exit path are:
>
> - Do nothing (pre FSGSBASE), if we interrupted the kernel
> as determined by the existing negative GS
> - Do swapgs, if we interrupted user space with positive GS
> (pre FSGSBASE), or we saved gs, but it was overwritten
> later by a context switch (with FSGSBASE)

We never run paranoid_exit if we interrupted userspace, and we can't
context switch on the IST stack, so I don't see how this is possible.

> - Restore from R15 (with FSGSBASE), if the gs base was saved
> in R15

What about case 4: we interrupted the kernel with usergs? (The code
seems more correct in this regard, but this description above is
confusing to me.)

>
> Context switch changes:
> ======================
>
> Then after these changes we need to also use the new instructions
> to save/restore fs and gs, so that the new values set by the
> users won't disappear. This is also significantly
> faster for the case when the 64bit base has to be switched
> (that is when GS is larger than 4GB), as we can replace
> the slow MSR write with a faster wr[fg]sbase execution.
>
> This is in term enables fast switching when there are
> enough threads that their TLS segment does not fit below 4GB
> (or with some newer systems which don't properly hint the
> stack limit), or alternatively programs that use fs as an additional base
> register will not get a sigificant context switch penalty.
>
> It is all done in a single patch because there was no
> simple way to do it in pieces without having crash
> holes inbetween.
>
> v2: Change to save/restore GS instead of using swapgs
> based on the value. Large scale changes.
> v3: Fix wrong flag initialization in fallback path.
> Thanks 0day!
> v4: Make swapgs code paths kprobes safe.
> Port to new base line code which now switches indexes.
> v5: Port to new kernel which avoids paranoid entry for ring 3.
> Removed some code that handled this previously.
> Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> ---
> arch/x86/kernel/cpu/common.c | 6 +++
> arch/x86/kernel/entry_64.S | 114 ++++++++++++++++++++++++++++++++++---------
> arch/x86/kernel/process_64.c | 42 ++++++++++++++--
> 3 files changed, 134 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 33a0293..7df88a3 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -957,6 +957,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
> #ifdef CONFIG_NUMA
> numa_add_cpu(smp_processor_id());
> #endif
> +
> + if (cpu_has(c, X86_FEATURE_FSGSBASE))
> + cr4_set_bits(X86_CR4_FSGSBASE);
> }
>
> #ifdef CONFIG_X86_64
> @@ -1351,10 +1354,13 @@ void cpu_init(void)
> */
> if (!oist->ist[0]) {
> char *estacks = per_cpu(exception_stacks, cpu);
> + void *gs = per_cpu(irq_stack_union.gs_base, cpu);
>
> for (v = 0; v < N_EXCEPTION_STACKS; v++) {
> if (v == DEBUG_STACK - 1)
> estacks = per_cpu(debug_stack, cpu);
> + /* Store GS at bottom of stack for bootstrap access */
> + *(void **)estacks = gs;
> estacks += exception_stack_sizes[v];
> oist->ist[v] = t->x86_tss.ist[v] =
> (unsigned long)estacks;

Seems reasonable to me.

You could possibly simplify some things if you wrote the address to
the bottom of *each* debug stack. Then you wouldn't need the extra
alignment stuff.

> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index f0095a7..0b74ab0 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -58,6 +58,8 @@
> #include <asm/context_tracking.h>
> #include <asm/smap.h>
> #include <asm/pgtable_types.h>
> +#include <asm/alternative-asm.h>
> +#include <asm/fsgs.h>
> #include <linux/err.h>
>
> /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
> @@ -218,32 +220,74 @@ ENDPROC(native_usergs_sysret64)
> CFI_REL_OFFSET r15, R15+\offset
> .endm
>
> +/* Values of the ebx flag: */
> +#define DO_RESTORE_R15 2 /* Restore GS at end */
> +#define DO_SWAPGS 1 /* Use SWAPGS at end */
> +#define DO_NOTHING 0 /* Back to ring 0 with same gs */
> +
> +/*
> + * Stack layout:
> + * +16 pt_regs
> + * +8 stack mask for ist or 0

What does that mean?

Oh, I get it. It's the size of the IST stack we're on. Let's please
make all IST stacks the same size as suggested above and get rid of
this. After all, they really are all the same size -- there's just
more than one debug stack.

> + * +0 return address.
> + */
> +#define OFF 16
> +#define STKMSK 8
> +
> ENTRY(save_paranoid)
> - XCPT_FRAME 1 RDI+8
> + XCPT_FRAME 1 RDI+OFF
> cld
> - movq %rdi, RDI+8(%rsp)
> - movq %rsi, RSI+8(%rsp)
> - movq_cfi rdx, RDX+8
> - movq_cfi rcx, RCX+8
> - movq_cfi rax, RAX+8
> - movq %r8, R8+8(%rsp)
> - movq %r9, R9+8(%rsp)
> - movq %r10, R10+8(%rsp)
> - movq %r11, R11+8(%rsp)
> - movq_cfi rbx, RBX+8
> - movq %rbp, RBP+8(%rsp)
> - movq %r12, R12+8(%rsp)
> - movq %r13, R13+8(%rsp)
> - movq %r14, R14+8(%rsp)
> - movq %r15, R15+8(%rsp)
> - movl $1,%ebx
> + movq %rdi, RDI+OFF(%rsp)
> + movq %rsi, RSI+OFF(%rsp)
> + movq_cfi rdx, RDX+OFF
> + movq_cfi rcx, RCX+OFF
> + movq_cfi rax, RAX+OFF
> + movq %r8, R8+OFF(%rsp)
> + movq %r9, R9+OFF(%rsp)
> + movq %r10, R10+OFF(%rsp)
> + movq %r11, R11+OFF(%rsp)
> + movq_cfi rbx, RBX+OFF
> + movq %rbp, RBP+OFF(%rsp)
> + movq %r12, R12+OFF(%rsp)
> + movq %r13, R13+OFF(%rsp)
> + movq %r14, R14+OFF(%rsp)
> + movq %r15, R15+OFF(%rsp)
> + movq $-1,ORIG_RAX+OFF(%rsp) /* no syscall to restart */
> +33:
> + ASM_NOP5 /* May be replaced with jump to paranoid_save_gs */

Is there some reason that the normal ALTERNATIVE macro can't be used here?

> +34:
> + movl $DO_NOTHING,%ebx
> movl $MSR_GS_BASE,%ecx
> rdmsr
> testl %edx,%edx
> js 1f /* negative -> in kernel */
> SWAPGS
> - xorl %ebx,%ebx
> + movl $DO_SWAPGS,%ebx
> 1: ret
> +
> + /* Patch in jump to paranoid_save_gs for X86_FEATURE_FSGSBASE */
> + .section .altinstr_replacement,"ax"
> +35: .byte 0xe9 /* 32bit near jump */
> + .long paranoid_save_gs-34b
> + .previous
> + .section .altinstructions,"a"
> + altinstruction_entry 33b,35b,X86_FEATURE_FSGSBASE,5,5
> + .previous
> +
> + /* Faster version not using RDMSR, and also not assuming
> + * anything about the previous GS value.
> + * This allows the user to arbitarily change GS using
> + * WRGSBASE.
> + */
> +paranoid_save_gs:
> + RDGSBASE_R15 # read previous gs
> + movq STKMSK(%rsp),%rax # get ist stack mask
> + andq %rsp,%rax # get bottom of stack
> + movq (%rax),%rdi # get expected GS
> + WRGSBASE_RDI # set gs for kernel
> + mov $DO_RESTORE_R15,%ebx # flag for exit code

I think this code isn't needed. There's a CPU feature that tells us
that we're in this case, so why do we need a marker in ebx?

> + ret
> +
> CFI_ENDPROC
> END(save_paranoid)
>
> @@ -1026,7 +1070,8 @@ apicinterrupt IRQ_WORK_VECTOR \
> */
> #define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8)
>
> -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
> +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 \
> + stack_mask=-EXCEPTION_STKSZ

This can be removed as well, I think.

> ENTRY(\sym)
> /* Sanity check */
> .if \shift_ist != -1 && \paranoid == 0
> @@ -1055,7 +1100,10 @@ ENTRY(\sym)
> testl $3, CS(%rsp) /* If coming from userspace, switch */
> jnz 1f /* stacks. */
> .endif
> + pushq_cfi $\stack_mask /* ist stack size */
> call save_paranoid
> + /* r15: previous gs */
> + popq_cfi %rax /* Drop ist stack size */

Ditto.

> .else
> call error_entry
> .endif
> @@ -1090,7 +1138,7 @@ ENTRY(\sym)
> .endif
>
> .if \paranoid
> - jmp paranoid_exit /* %ebx: no swapgs flag */
> + jmp paranoid_exit /* %ebx: no swapgs flag, r15: old gs */
> .else
> jmp error_exit /* %ebx: no swapgs flag */
> .endif
> @@ -1311,9 +1359,12 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
> hyperv_callback_vector hyperv_vector_handler
> #endif /* CONFIG_HYPERV */
>
> -idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
> -idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
> +idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
> + stack_mask=-DEBUG_STKSZ
> +idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
> + stack_mask=-DEBUG_STKSZ
> idtentry stack_segment do_stack_segment has_error_code=1
> +

Ditto.

> #ifdef CONFIG_XEN
> idtentry xen_debug do_debug has_error_code=0
> idtentry xen_int3 do_int3 has_error_code=0
> @@ -1339,17 +1390,25 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
> * to try to handle preemption here.
> */
>
> - /* ebx: no swapgs flag */
> + /* ebx: no swapgs flag, r15: gs if ebx == DO_RESTORE_R15 */
> ENTRY(paranoid_exit)
> DEFAULT_FRAME
> DISABLE_INTERRUPTS(CLBR_NONE)
> TRACE_IRQS_OFF_DEBUG
> testl %ebx,%ebx /* swapgs needed? */
> jnz paranoid_restore
> + cmpl $DO_NOTHING,%ebx /* swapgs needed? */
> + je paranoid_restore

Something's wrong here, I think. DO_NOTHING == 0, so this is just
duplicate code.

> TRACE_IRQS_IRETQ 0
> + cmpl $DO_RESTORE_R15,%ebx /* restore gs? */
> + je paranoid_restore_gs

Can't we use alternatives here?

> SWAPGS_UNSAFE_STACK
> RESTORE_ALL 8
> +paranoid_restore_gs:
> + WRGSBASE_R15

I'd rather jump back a couple instructions and avoid lots of
duplication below, especially since...

> + RESTORE_ALL 8
> INTERRUPT_RETURN
> + jmp irq_return

... this jmp irq_return is dead code.

> paranoid_restore:
> TRACE_IRQS_IRETQ_DEBUG 0
> RESTORE_ALL 8
> @@ -1650,6 +1709,7 @@ end_repeat_nmi:
> pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */
> subq $ORIG_RAX-R15, %rsp
> CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
> + pushq_cfi $-EXCEPTION_STKSZ /* ist stack size */

This needs Steven's blessing, I think. This stuff is already very twisted.

> /*
> * Use save_paranoid to handle SWAPGS, but no need to use paranoid_exit
> * as we should not be calling schedule in NMI context.
> @@ -1658,6 +1718,8 @@ end_repeat_nmi:
> * exceptions might do.
> */
> call save_paranoid
> + /* r15: previous gs */
> + popq_cfi %rax /* pop ist size */
> DEFAULT_FRAME 0
>
> /*
> @@ -1682,11 +1744,15 @@ end_repeat_nmi:
> je 1f
> movq %r12, %cr2
> 1:
> -
> + cmpl $DO_RESTORE_R15,%ebx
> + je nmi_gs_restore
> testl %ebx,%ebx /* swapgs needed? */
> jnz nmi_restore
> nmi_swapgs:
> SWAPGS_UNSAFE_STACK
> + jmp nmi_restore
> +nmi_gs_restore:
> + WRGSBASE_R15 /* restore gs */
> nmi_restore:
> /* Pop the extra iret frame at once */
> RESTORE_ALL 6*8
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 67fcc43..3019c51 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -49,6 +49,7 @@
> #include <asm/syscalls.h>
> #include <asm/debugreg.h>
> #include <asm/switch_to.h>
> +#include <asm/fsgs.h>
>
> asmlinkage extern void ret_from_fork(void);
>
> @@ -261,6 +262,27 @@ void start_thread_ia32(struct pt_regs *regs, u32 new_ip, u32 new_sp)
> }
> #endif
>
> +/* Out of line to be protected from kprobes. */
> +
> +/* Interrupts are disabled here. */
> +static noinline __kprobes void switch_gs(unsigned long gs)
> +{
> + swapgs();
> + wrgsbase(gs);
> + swapgs();
> +}

NOKPROBE_SYMBOL is preferred these days, I think. Also, let's call
this something more helpful like write_user_gsbase.

> +
> +/* Interrupts are disabled here. */
> +static noinline __kprobes unsigned long read_user_gs(void)
> +{
> + unsigned long gs;
> +
> + swapgs();
> + gs = rdgsbase();
> + swapgs();
> + return gs;
> +}

How about read_user_gsbase?

> +
> /*
> * switch_to(x,y) should switch tasks from x to y.
> *
> @@ -293,6 +315,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> */
> savesegment(fs, fsindex);
> savesegment(gs, gsindex);
> + if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
> + prev->fs = rdfsbase();
> + prev->gs = read_user_gs();
> + }
>
> /*
> * Load TLS before restoring any segments so that segment loads
> @@ -381,8 +407,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> if (fsindex)
> prev->fs = 0;

Unless I misunderstood the docs, the assumption that fsindex != 0
implies no fsbase override isn't really accurate any more. (It never
was really accurate, but it used to be a lot closer to the truth.)

> }
> - if (next->fs)
> - wrmsrl(MSR_FS_BASE, next->fs);
> + if (next->fs) {
> + if (static_cpu_has(X86_FEATURE_FSGSBASE))
> + wrfsbase(next->fs);
> + else
> + wrmsrl(MSR_FS_BASE, next->fs);
> + }
> prev->fsindex = fsindex;
>
> if (unlikely(gsindex | next->gsindex | prev->gs)) {
> @@ -392,8 +422,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> if (gsindex)
> prev->gs = 0;
> }
> - if (next->gs)
> - wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
> + if (next->gs) {
> + if (static_cpu_has(X86_FEATURE_FSGSBASE))
> + switch_gs(next->gs);
> + else
> + wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
> + }
> prev->gsindex = gsindex;
>
> switch_fpu_finish(next_p, fpu);
> --
> 1.9.3
>

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/