Re: [PATCH v5 15/28] x86/arch_prctl: Create ARCH_GET_XSTATE/ARCH_PUT_XSTATE

From: Len Brown
Date: Tue May 25 2021 - 20:38:32 EST


After today's discussion, I believe we are close to consensus on this plan:

1. Kernel sets XCR0.AMX=1 at boot, and leaves it set, always.

2. Kernel arms XFD for all tasks, by default.

3. New prctl() system call allows any task in a process to ask for AMX
permission for all tasks in that process. Permission is granted for
the lifetime of that process, and there is no interface for a process
to "un-request" permission.

4. If a task touches AMX without permission, #NM/signal kills the process

5. If a task touches AMX with permission, #NM handler will
transparently allocate a context switch buffer, and disarm XFD for
that task. (MSR_XFD is context switched per task)

6. If the #NM handler can not allocate the 8KB buffer, the task will
receive a signal at the instruction that took the #NM fault, likely
resulting in program exit.

7. In addition, a 2nd system call to request that buffers be
pre-allocated is available. This is a per task system call. This
synchronous allocate system call will return an error code if it
fails, which will also likely result in program exit.

8. NEW TODAY: Linux will exclude the AMX 8KB region from the XSTATE on
the signal stack for tasks in process that do not have AMX permission.

9. For tasks in processes that have requested and received AMX
permission, Linux will XSAVE/XRESTOR directly to/from the signal
stack, and the stack will always include the 8KB space for AMX. (we do
have a manual optimization to in place to skip writing zeros to the
stack frame if AMX=INIT)

10. Linux reserves the right to plumb the new permission syscall into
cgroup administrative interface in the future.

Comments:

Legacy software will not see signal stack growth on AMX hardware.

New AMX software will see AMX state on the signal stack.

If new AMX software uses an alternative signal stack, it should be
built using the signal.h ABI in glibc 2.34 or later, so that it can
calculate the appropriate size for the current hardware. Note that
non-AMX software that is newly built will get the same answer from the
ABI, which would handle the case if it does use AMX.

Today it is possible for an application to calculate the uncompressed
XSTATE size from XCR0 and CPUID, allocate buffers of that size, and
use XSAVE and XRESTOR on those buffers in user-space. Applications
can also XRESTOR from (and XSAVE back to) the signal stack, if they
choose. Now, this capability can break for non-AMX programs, because
their XSAVE will be 8KB larger than the buffer that they XRESTOR.
Andy L questions whether such applications actually exist, and Thomas
states that even if they do, that is a much smaller problem than 8KB
signal stack growth would be for legacy applications.

Unclear if we have consensus on the need for a synchronous allocation
system call (#7 above). Observe that this system call does not
improve the likelihood of failure or the timing of failure. An
#NM-based allocation and be done at exactly the same spot by simply
touching a TMM register. The benefit of this system call is that it
returns an error code to the caller, versus the program being
delivered a SIGSEGV at the offending instruction pointer. Both will
likely result in the program exiting, and at the same point in
execution.

A future mechanism to lazy harvest not-recently-used context switchy
buffers has been discussed. Eg. the kernel under low memory could
re-arm XFD for all AMX tasks, and if their buffers are clean, free
them. If that mechanism is implemented, and we also implement the
synchronous allocation system call, that mechanism must respect the
guarantee made by that system call and not harvest
system-call-allocated buffers.

Len Brown, Intel Open Source Technology Center