Re: [PATCH v9 14/26] x86/arch_prctl: Create ARCH_SET_STATE_ENABLE/ARCH_GET_STATE_ENABLE

From: Bae, Chang Seok
Date: Mon Aug 09 2021 - 18:08:29 EST


On Aug 6, 2021, at 09:46, Macieira, Thiago <thiago.macieira@xxxxxxxxx> wrote:
> On Friday, 30 July 2021 07:59:45 PDT Chang S. Bae wrote:
>> + for_each_thread(tsk, t) {
>> + t->thread.fpu.dynamic_state_perm |= req_dynstate_perm;
>> + nr_threads++;
>> + }
>> +
>> + if (nr_threads != tsk->signal->nr_threads) {
>> + for_each_thread(tsk, t)
>> + t->thread.fpu.dynamic_state_perm =
>> old_dynstate_perm;
>> + pr_err("x86/fpu: ARCH_XSTATE_PERM failed
>> as thread number mismatched.\n");
>> + return -EBUSY;
>> + }
>> + return 0;
>> +}

<snip>

> First the simpler one: that EBUSY. It must go and you can do that with a lock.
> Library code cannot ensure that it is running in single-threaded state and
> that no other threads are started or exit while they make the system call.
> There's nothing the library in question can do if it got an EBUSY. Do you want
> me to try again? What if it fails again? What's the state of the dynamically
> permitted states after an EBUSY? It's probably inconsistent. Moreover, there's
> an ABA problem there: what happens if a thread starts and another exits while
> this system call is running? And what happens if two threads are making this
> system call?
> (also, shouldn't tsk->signal->nr_threads be an atomic read?)

I suspect the EBUSY situation is somewhat imaginative. In theory, the
situation might be possible one thread calls this syscall at some point when a
new task is being created -- after task data duplication [1] and before
enlisted [2].

As stated in the changelog, the alternative is possible:
> An alternative implementation would not save the permission bitmap in
> every task. But instead would extend the per-process signal data, and
> that would not be subject to this race.
But it involves quite a bit of code complexity and this is pretty much
backend. I think it is possible to follow up and update when the case ever
turns out to be real. At least, I'm not aware of any report against the
PR_SET_FP_MODE prctl(2) [3] which took the same way -- walk and update the
task list.

Perhaps, the hunk above can be improved to be atomic.

<snip>

> So I have to insist that the XGETBV instruction's result match exactly what is
> permitted to run. That means we either enable AMX unconditionally with no need
> for system calls (with or without XFD trapping to dynamically allocate more
> state), or that the XCR0 register be set without the AMX bits by default,
> until the system call is issued.

XCR0 provokes VMEXIT which will impact the performance hardly. At least the
opt-in model is a consensus out of the long debate [4]. Let alone the question
on how well advertise this new syscall though.

Thanks,
Chang

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n2128
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n2320
[3]: https://man7.org/linux/man-pages/man2/prctl.2.html
[4]: https://lore.kernel.org/lkml/CALCETrW2QHa2TLvnUuVxAAheqcbSZ-5_WRXtDSAGcbG8N+gtdQ@xxxxxxxxxxxxxx/