Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

From: Andy Lutomirski
Date: Mon Oct 03 2016 - 16:54:31 EST


On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel <riel@xxxxxxxxxx> wrote:
> On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 1:31 PM, <riel@xxxxxxxxxx> wrote:
>> >
>> > #define CREATE_TRACE_POINTS
>> > #include <trace/events/syscalls.h>
>> > @@ -197,6 +198,9 @@ __visible inline void
>> > prepare_exit_to_usermode(struct pt_regs *regs)
>> > if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
>> > exit_to_usermode_loop(regs, cached_flags);
>> >
>> > + if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > + switch_fpu_return();
>> > +
>>
>> How about:
>>
>> if (unlikely(...)) {
>> exit_to_usermode_loop(regs, cached_flags);
>> cached_flags = READ_ONCE(ti->flags);
>> }
>>
>> if (ti->flags & _TIF_LOAD_FPU) {
>> clear_thread_flag(TIF_LOAD_FPU);
>> switch_fpu_return();
>> }
>
> It looks like test_thread_flag should be fine for avoiding
> atomics, too?
>
> if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
> clear_thread_flag(TIF_LOAD_FPU);
> switch_fpu_return();
> }

True, but we've got that cached_flags variable already...

>
>
>> > +static inline void switch_fpu_finish(void)
>> > {
>> > + set_thread_flag(TIF_LOAD_FPU);
>> > }
>>
>> I can imagine this causing problems with kernel code that accesses
>> current's FPU state, e.g. get_xsave_field_ptr(). I wonder if it
>> would
>> make sense to make your changes deeper into the FPU core innards so
>> that, for example, we'd have explicit functions that cause the
>
> Whereabouts do you have in mind?
>
> I like your general idea, but am not sure quite where
> to put it.

Somewhere in arch/x86/kernel/fpu/ perhaps?

One of my objections to the current code structure is that I find it
quite hard to keep track of all of the possible states of the fpu that
we have. Off the top of my head, we have:

1. In-cpu state is valid and in-memory state is invalid. (This is the
case when we're running user code, for example.)
2. In-memory state is valid.
2a. In-cpu state is *also* valid for some particular task. In this
state, no one may write to either FPU state copy for that task lest
they get out of sync.
2b. In-cpu state is not valid.

Rik, your patches further complicate this by making it possible to
have the in-cpu state be valid for a non-current task even when
non-idle. This is why I'm thinking that we should maybe revisit the
API a bit to make this clearer and to avoid scattering around all the
state manipulation quite so much. I propose, as a straw-man:

user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for current.
After calling this, we're in state 1 or 2a.

user_fpregs_move_to_cpu() -- Makes the CPU state be valid and writable
for current. Invalidates any in-memory copy. After calling this,
we're in state 1.

user_fpregs_copy_to_memory() -- Makes the in-memory state be valid for
current. After calling this, we're in state 2a or 2b.

user_fpregs_move_to_memory() -- Makes the in-memory state be valid and
writable for current. After calling this, we're in state 2b.

The xxx_to_cpu() functions will make sure that, if the cpu state was
valid for a different task, it stops being valid for that different
task.

Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
and PKRU memory state readers. MPX and PKRU will do
user_fpregs_move_to_memory() before writing to memory.
kernel_fpu_begin() does user_fpregs_move_to_memory().

There are a couple ways we could deal with TIF_LOAD_FPU in this
scheme. My instinct would be to rename it TIF_REFRESH_FPU, set it
unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
and make the handler do user_fpregs_move_to_cpu(). This avoids
needing to deal with complications in which we're in state 2a and we
land in actual user code and then forget that the in-memory copy is
out of date. It also means we get TIF_REFRESH_FPU for free for
newly-forked threads.

>> in-memory state for current to be up-to-date and readable, to cause
>> the in-memory state to be up-to-date and writable (which is the same
>> thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
>> in-CPU state to be up-to-date (possibly readable and writable).
>> TIF_LOAD_FPU would trigger the latter.
>>
>> I've often found it confusing that fpu__save by itself has somewhat
>> ill-defined effects.
>
> Fully agreed on both. I guess that means we want a few
> different functions:
>
> 1) initialize FPU state, both in memory and in registers
>
> 2) cause FPU state in registers to be updated from memory,
> if necessary
>
> 3) cause FPU state in memory to be updated from registers,
> if necessary
>
> The latter functions could use fpu->fpregs_active to see
> whether any action is required, and be called from various
> places in the code.
>
> The signal code in particular is doing some strange things
> that should probably be hidden away in FPU specific functions.
>
>>
>> >
>> > +/*
>> > + * Set up the userspace FPU context before returning to userspace.
>> > + */
>> > +void switch_fpu_return(void)
>> > +{
>> > + struct fpu *fpu = &current->thread.fpu;
>> > + bool preload;
>> > + /*
>> > + * If the task has used the math, pre-load the FPU on xsave
>> > processors
>> > + * or if the past 5 consecutive context-switches used math.
>> > + */
>> > + preload = static_cpu_has(X86_FEATURE_FPU) &&
>> > + fpu->fpstate_active &&
>> > + (use_eager_fpu() || fpu->counter > 5);
>> > +
>> > + if (preload) {
>> > + prefetch(&fpu->state);
>> > + fpu->counter++;
>> > + __fpregs_activate(fpu);
>> > + trace_x86_fpu_regs_activated(fpu);
>> > +
>> > + /* Don't change CR0.TS if we just switch! */
>> > + if (!__this_cpu_read(fpu_active)) {
>> > + __fpregs_activate_hw();
>> > + __this_cpu_write(fpu_active, true);
>> > + }
>>
>> We should just finish getting rid of all TS uses.
>
> Agreed. I will rebase on top of your FPU changes, that
> will make things easier for everybody.
>
> --
> All Rights Reversed.



--
Andy Lutomirski
AMA Capital Management, LLC