Re: [PATCH] x86/fpu: Correct pkru/xstate inconsistency

From: Guenter Roeck
Date: Tue Feb 15 2022 - 10:57:35 EST


Hi Brian,

On Tue, Feb 15, 2022 at 7:37 AM Brian Geffon <bgeffon@xxxxxxxxxx> wrote:
>
> There are two issues with PKRU handling prior to 5.13. The first is that

Nice catch and work. One question, though: From the above, it seems
like this patch only applies to kernels earlier than v5.13 or, more
specifically, to v5.4.y and v5.10.y. Is this correct, or should it be
applied to the upstream kernel and to all applicable stable releases ?

Thanks,
Guenter

> when eagerly switching PKRU we check that current is not a kernel
> thread as kernel threads will never use PKRU. It's possible that
> this_cpu_read_stable() on current_task (ie. get_current()) is returning
> an old cached value. By forcing the read with this_cpu_read() the
> correct task is used. Without this it's possible when switching from
> a kernel thread to a userspace thread that we'll still observe the
> PF_KTHREAD flag and never restore the PKRU. And as a result this
> issue only occurs when switching from a kernel thread to a userspace
> thread, switching from a non kernel thread works perfectly fine because
> all we consider in that situation is the flags from some other non
> kernel task and the next fpu is passed in to switch_fpu_finish().
>
> Without reloading the value finish_fpu_load() after being inlined into
> __switch_to() uses a stale value of current:
>
> ba1: 8b 35 00 00 00 00 mov 0x0(%rip),%esi
> ba7: f0 41 80 4d 01 40 lock orb $0x40,0x1(%r13)
> bad: e9 00 00 00 00 jmp bb2 <__switch_to+0x1eb>
> bb2: 41 f6 45 3e 20 testb $0x20,0x3e(%r13)
> bb7: 75 1c jne bd5 <__switch_to+0x20e>
>
> By using this_cpu_read() and avoiding the cached value the compiler does
> insert an additional load instruction and observes the correct value now:
>
> ba1: 8b 35 00 00 00 00 mov 0x0(%rip),%esi
> ba7: f0 41 80 4d 01 40 lock orb $0x40,0x1(%r13)
> bad: e9 00 00 00 00 jmp bb2 <__switch_to+0x1eb>
> bb2: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax
> bb9: 00
> bba: f6 40 3e 20 testb $0x20,0x3e(%rax)
> bbe: 75 1c jne bdc <__switch_to+0x215>
>
> The second issue is when using write_pkru() we only write to the
> xstate when the feature bit is set because get_xsave_addr() returns
> NULL when the feature bit is not set. This is problematic as the CPU
> is free to clear the feature bit when it observes the xstate in the
> init state, this behavior seems to be documented a few places throughout
> the kernel. If the bit was cleared then in write_pkru() we would happily
> write to PKRU without ever updating the xstate, and the FPU restore on
> return to userspace would load the old value agian.
>
> Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state")
> Signed-off-by: Brian Geffon <bgeffon@xxxxxxxxxx>
> Signed-off-by: Willis Kung <williskung@xxxxxxxxxx>
> Tested-by: Willis Kung <williskung@xxxxxxxxxx>
> ---
> arch/x86/include/asm/fpu/internal.h | 2 +-
> arch/x86/include/asm/pgtable.h | 14 ++++++++++----
> 2 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 03b3de491b5e..540bda5bdd28 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
> * PKRU state is switched eagerly because it needs to be valid before we
> * return to userland e.g. for a copy_to_user() operation.
> */
> - if (!(current->flags & PF_KTHREAD)) {
> + if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) {
> /*
> * If the PKRU bit in xsave.header.xfeatures is not set,
> * then the PKRU component was in init state, which means
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 9e71bf86d8d0..aa381b530de0 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru)
> if (!boot_cpu_has(X86_FEATURE_OSPKE))
> return;
>
> - pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> -
> /*
> * The PKRU value in xstate needs to be in sync with the value that is
> * written to the CPU. The FPU restore on return to userland would
> * otherwise load the previous value again.
> */
> fpregs_lock();
> - if (pk)
> - pk->pkru = pkru;
> + /*
> + * The CPU is free to clear the feature bit when the xstate is in the
> + * init state. For this reason, we need to make sure the feature bit is
> + * reset when we're explicitly writing to pkru. If we did not then we
> + * would write to pkru and it would not be saved on a context switch.
> + */
> + current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;
> + pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> + BUG_ON(!pk);
> + pk->pkru = pkru;
> __write_pkru(pkru);
> fpregs_unlock();
> }
> --
> 2.35.1.265.g69c8d7142f-goog
>