Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

From: Andy Lutomirski
Date: Mon Oct 03 2016 - 17:36:41 EST


On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel <riel@xxxxxxxxxx> wrote:
> On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel <riel@xxxxxxxxxx> wrote:
>> >
>> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> > >
>> > > On Sat, Oct 1, 2016 at 1:31 PM, <riel@xxxxxxxxxx> wrote:
>> > > >
>> > > >
>> > > > #define CREATE_TRACE_POINTS
>> > > > #include <trace/events/syscalls.h>
>> > > > @@ -197,6 +198,9 @@ __visible inline void
>> > > > prepare_exit_to_usermode(struct pt_regs *regs)
>> > > > if (unlikely(cached_flags &
>> > > > EXIT_TO_USERMODE_LOOP_FLAGS))
>> > > > exit_to_usermode_loop(regs, cached_flags);
>> > > >
>> > > > + if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > > > + switch_fpu_return();
>> > > > +
>> > > How about:
>> > >
>> > > if (unlikely(...)) {
>> > > exit_to_usermode_loop(regs, cached_flags);
>> > > cached_flags = READ_ONCE(ti->flags);

Like this?

>> > > }
>> > >
>> > > if (ti->flags & _TIF_LOAD_FPU) {
>> > > clear_thread_flag(TIF_LOAD_FPU);
>> > > switch_fpu_return();
>> > > }
>> > It looks like test_thread_flag should be fine for avoiding
>> > atomics, too?
>> >
>> > if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
>> > clear_thread_flag(TIF_LOAD_FPU);
>> > switch_fpu_return();
>> > }
>> True, but we've got that cached_flags variable already...
>
> The cached flag may no longer be valid after
> exit_to_usermode_loop() returns, because that
> code is preemptible.
>
> We have to reload a fresh ti->flags after
> preemption is disabled and irqs are off.

See above...

>
>> One of my objections to the current code structure is that I find it
>> quite hard to keep track of all of the possible states of the fpu
>> that
>> we have. Off the top of my head, we have:
>>
>> 1. In-cpu state is valid and in-memory state is invalid. (This is
>> the
>> case when we're running user code, for example.)
>> 2. In-memory state is valid.
>> 2a. In-cpu state is *also* valid for some particular task. In this
>> state, no one may write to either FPU state copy for that task lest
>> they get out of sync.
>> 2b. In-cpu state is not valid.
>
> It gets more fun with ptrace. Look at xfpregs_get
> and xfpregs_set, which call fpu__activate_fpstate_read
> and fpu__activate_fpstate_write respectively.
>
> It looks like those functions already deal with the
> cases you outline above.

Hmm, maybe they could be used. The names are IMO atrocious. I had to
read the docs and the code to figure out WTF "activating" the
"fpstate" even meant.

>
>> Rik, your patches further complicate this by making it possible to
>> have the in-cpu state be valid for a non-current task even when
>> non-idle.
>
> However, we never need to examine the in-cpu state for
> a task that is not currently running, or being scheduled
> to in the context switch code.
>
> We always save the FPU register state when a task is
> context switched out, so for a non-current task, we
> still only need to consider whether the in-memory state
> is valid or not.
>
> The in-cpu state does not matter.
>
> fpu__activate_fpstate_write will disable lazy restore,
> and ensure a full restore from memory.

Something has to fix up owner_ctx so that the previous thread doesn't
think its CPU state is still valid. I don't think
activate_fpstate_write does it, because nothing should (I think) be
calling it as a mattor of course on context switch.

>
>> This is why I'm thinking that we should maybe revisit the
>> API a bit to make this clearer and to avoid scattering around all the
>> state manipulation quite so much. I propose, as a straw-man:
>>
>> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
>> current.
>> After calling this, we're in state 1 or 2a.
>>
>> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
>> writable
>> for current. Invalidates any in-memory copy. After calling this,
>> we're in state 1.
>
> How can we distinguish between these two?
>
> If the task is running, it can change its FPU
> registers at any time, without any obvious thing
> to mark the transition from state 2a to state 1.

That's true now, but I don't think it's really true any more with your
patches. If a task is running in *kernel* mode, then I think that
pretty much all of the combinations are possible. When the task
actually enters user mode, we have to make sure we're in state 1 so
the CPU regs belong to the task and are writable.

>
> It sounds like the first function you name would
> only be useful if we prevented a switch to user
> space, which could immediately do whatever it wants
> with the registers.
>
> Is there a use case for user_fpregs_copy_to_cpu()?

With PKRU, any user memory access would need
user_fpregs_copy_to_cpu(). I think we should get rid of this and just
switch to using RDPKRU and WRPKRU to eagerly update PKRU in every
context switch to avoid this. Other than that, I can't think of any
use cases off the top of my head unless the signal code wanted to play
games with this stuff.

>
>> user_fpregs_copy_to_memory() -- Makes the in-memory state be valid
>> for
>> current. After calling this, we're in state 2a or 2b.
>>
>> user_fpregs_move_to_memory() -- Makes the in-memory state be valid
>> and
>> writable for current. After calling this, we're in state 2b.
>
> We can only ensure that the in memory copy stays
> valid, by stopping the task from writing to the
> in-CPU copy.
>
> This sounds a lot like fpu__activate_fpstate_read
>
> What we need after copying the contents to memory is
> a way to invalidate the in-CPU copy, if changes were
> made to the in-memory copy.

My proposal is that we invalidate the CPU state *before* writing the
in-memory state, not after.

>
> This is already done by fpu__activate_fpstate_write
>
>> Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
>> and PKRU memory state readers. MPX and PKRU will do
>> user_fpregs_move_to_memory() before writing to memory.
>> kernel_fpu_begin() does user_fpregs_move_to_memory().
>
> This is done by switch_fpu_prepare.
>
> Can you think of any other place in the kernel where we
> need a 1 -> 2a transition, besides context switching and
> ptrace/xfpregs_get?

Anything else that tries to read task xstate from memory, i.e. MPX and
PKRU. (Although if we switch to eager-switched PKRU, then PKRU stops
mattering for this purpose.)

Actually, I don't see any way your patches can be compatible with PKRU
without switching to eager-switched PKRU.

>
>> There are a couple ways we could deal with TIF_LOAD_FPU in this
>> scheme. My instinct would be to rename it TIF_REFRESH_FPU, set it
>> unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
>> and make the handler do user_fpregs_move_to_cpu().
>
> I can see where you are coming from, but given that
> there is no way to run a task besides context switching
> to it, I am not sure why we would need to do anything
> besides disabling the lazy restore (indicating that the
> in-CPU state is stale) anywhere else?
>
> If user_fpregs_move_to_cpu determines that the in-register
> state is still valid, it can skip the restore.
>
> This is what switch_fpu_finish does after my first patch,
> and switch_fpu_prepare later in my series.

I think my main objection is that there are too many functions like
this outside the fpu/ directory that all interact in complicated ways.
I think that activate_fpststate_read/write are along the right lines,
but we should start using them and their friends everywhere.