Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace
From: Andy Lutomirski
Date: Sat Oct 01 2016 - 19:44:51 EST
On Sat, Oct 1, 2016 at 1:31 PM, <riel@xxxxxxxxxx> wrote:
> From: Rik van Riel <riel@xxxxxxxxxx>
>
> Delay the loading of FPU registers until a process switches back to
> userspace. This allows us to skip FPU saving & restoring for kernel
> threads, the idle task, and tasks that are spinning in kernel space.
>
> It also allows us to not repeatedly save & restore the userspace FPU
> context on repeated invocations of kernel_fpu_start & kernel_fpu_end.
>
> Not overwriting the FPU state of a task unless we need to also allows
> us to be be lazier about restoring it, in a future patch.
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>
> ---
> arch/x86/entry/common.c | 4 ++++
> arch/x86/include/asm/fpu/api.h | 5 +++++
> arch/x86/include/asm/fpu/internal.h | 44 +++++++++----------------------------
> arch/x86/include/asm/thread_info.h | 4 +++-
> arch/x86/kernel/fpu/core.c | 17 ++++++++------
> arch/x86/kernel/process.c | 35 +++++++++++++++++++++++++++++
> arch/x86/kernel/process_32.c | 5 ++---
> arch/x86/kernel/process_64.c | 5 ++---
> 8 files changed, 71 insertions(+), 48 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 1433f6b4607d..a69bbefa3408 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -27,6 +27,7 @@
> #include <asm/vdso.h>
> #include <asm/uaccess.h>
> #include <asm/cpufeature.h>
> +#include <asm/fpu/api.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/syscalls.h>
> @@ -197,6 +198,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
> if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
> exit_to_usermode_loop(regs, cached_flags);
>
> + if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> + switch_fpu_return();
> +
How about:
if (unlikely(...)) {
exit_to_usermode_loop(regs, cached_flags);
cached_flags = READ_ONCE(ti->flags);
}
if (ti->flags & _TIF_LOAD_FPU) {
clear_thread_flag(TIF_LOAD_FPU);
switch_fpu_return();
}
or something along those lines. The issue is that
test_and_clear_thread_flag is unconditionally atomic, which means that
it will slow down the common fast case by quite a bit.
Alternatively, you could try to jam it into thread_struct::status,
which would let you avoid an atomic operation when you clear it, but
you'd still need an atomic op to set it, so this might not be a win.
> +static inline void switch_fpu_finish(void)
> {
> - bool preload;
> - /*
> - * If the task has used the math, pre-load the FPU on xsave processors
> - * or if the past 5 consecutive context-switches used math.
> - */
> - preload = static_cpu_has(X86_FEATURE_FPU) &&
> - new_fpu->fpstate_active &&
> - (use_eager_fpu() || new_fpu->counter > 5);
> -
> - if (preload) {
> - prefetch(&new_fpu->state);
> - new_fpu->counter++;
> - __fpregs_activate(new_fpu);
> - trace_x86_fpu_regs_activated(new_fpu);
> -
> - /* Don't change CR0.TS if we just switch! */
> - if (!__this_cpu_read(fpu_active)) {
> - __fpregs_activate_hw();
> - __this_cpu_write(fpu_active, true);
> - }
> -
> - copy_kernel_to_fpregs(&new_fpu->state);
> - } else if (__this_cpu_read(fpu_active)) {
> - __this_cpu_write(fpu_active, false);
> - __fpregs_deactivate_hw();
> - }
> + set_thread_flag(TIF_LOAD_FPU);
> }
I can imagine this causing problems with kernel code that accesses
current's FPU state, e.g. get_xsave_field_ptr(). I wonder if it would
make sense to make your changes deeper into the FPU core innards so
that, for example, we'd have explicit functions that cause the
in-memory state for current to be up-to-date and readable, to cause
the in-memory state to be up-to-date and writable (which is the same
thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
in-CPU state to be up-to-date (possibly readable and writable).
TIF_LOAD_FPU would trigger the latter.
I've often found it confusing that fpu__save by itself has somewhat
ill-defined effects.
> +/*
> + * Set up the userspace FPU context before returning to userspace.
> + */
> +void switch_fpu_return(void)
> +{
> + struct fpu *fpu = ¤t->thread.fpu;
> + bool preload;
> + /*
> + * If the task has used the math, pre-load the FPU on xsave processors
> + * or if the past 5 consecutive context-switches used math.
> + */
> + preload = static_cpu_has(X86_FEATURE_FPU) &&
> + fpu->fpstate_active &&
> + (use_eager_fpu() || fpu->counter > 5);
> +
> + if (preload) {
> + prefetch(&fpu->state);
> + fpu->counter++;
> + __fpregs_activate(fpu);
> + trace_x86_fpu_regs_activated(fpu);
> +
> + /* Don't change CR0.TS if we just switch! */
> + if (!__this_cpu_read(fpu_active)) {
> + __fpregs_activate_hw();
> + __this_cpu_write(fpu_active, true);
> + }
We should just finish getting rid of all TS uses.