RE: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

From: Brown, Len
Date: Tue Oct 13 2020 - 18:34:48 EST


> From: Andy Lutomirski <luto@xxxxxxxxxx>

> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
..
> > +bool xfirstuse_event_handler(struct fpu *fpu)
> > +{
> > + bool handled = false;
> > + u64 event_mask;
> > +
> > + /* Check whether the first-use detection is running. */
> > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > + return handled;
> > +

> MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in
> some helper called farther down the stack

xfirstuse_event_handler() is called directly from the IDTENTRY exc_device_not_available():

> > @@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available)
> > {
> > unsigned long cr0 = read_cr0();
> >
> > + if (xfirstuse_event_handler(&current->thread.fpu))
> > + return;

Are you suggesting we should instead open code it inside that routine?

> But this raises an interesting point -- what happens if allocation
> fails? I think that, from kernel code, we simply cannot support this
> exception mechanism. If kernel code wants to use AMX (and that would
> be very strange indeed), it should call x86_i_am_crazy_amx_begin() and
> handle errors, not rely on exceptions. From user code, I assume we
> send a signal if allocation fails.

The XFD feature allows us to transparently expand the kernel context switch buffer
for a user task, when that task first touches this associated hardware.
It allows applications to operate as if the kernel had allocated the backing AMX
context switch buffer at initialization time. However, since we expect only
a sub-set of tasks to actually use AMX, we instead defer allocation until
we actually see first use for that task, rather than allocating for all tasks.

While we currently don't propose AMX use inside the kernel, it is conceivable
that could be done in the future, just like AVX is used by the RAID code;
and it would be done the same way -- kernel_fpu_begin()/kernel_fpu_end().
Such future Kernel AMX use would _not_ arm XFD, and would _not_ provoke this fault.
(note that we context switch the XFD-armed state per task)

vmalloc() does not fail, and does not return an error, and so there is no concept
of returning a signal. If we got to the point where vmalloc() sleeps, then the system
has bigger OOM issues, and the OOM killer would be on the prowl.

If we were concerned about using vmalloc for a couple of pages in the task structure,
Then we could implement a routine to harvest unused buffers and free them --
but that didn't seem worth the complexity. Note that this feature is 64-bit only.

Thanks,
-Len