Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

From: Andy Lutomirski
Date: Sat Mar 20 2021 - 18:28:18 EST


On Sat, Mar 20, 2021 at 3:13 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> > +
> > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
> > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> > +{
> > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > + return;
> > +
> > + if (unlikely(prev->state_mask != next->state_mask))
> > + xdisable_setbits(xfirstuse_not_detected(next));
> > +}
>
> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
> when it does not match. The spec document says:
>
> "System software may disable use of Intel AMX by clearing XCR0[18:17], by
> clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
> system software initialize AMX state (e.g., by executing TILERELEASE)
> before doing so. This is because maintaining AMX state in a
> non-initialized state may have negative power and performance
> implications."
>
> I'm not seeing anything related to this. Is this a recommendation
> which can be ignored or is that going to be duct taped into the code
> base once the first user complains about slowdowns of their non AMX
> workloads on that machine?

I have an obnoxious question: do we really want to use the XFD mechanism?

Right now, glibc, and hence most user space code, blindly uses
whatever random CPU features are present for no particularly good
reason, which means that all these features get stuck in the XINUSE=1
state, even if there is no code whatsoever in the process that
benefits. AVX512 is bad enough as we're seeing right now. AMX will
be much worse if this happens.

We *could* instead use XCR0 and require an actual syscall to enable
it. We could even then play games like requiring whomever enables the
feature to allocate memory for the state save area for signals, and
signal delivery could save the state and disable the feature, this
preventing the signal frame from blowing up to 8 or 12 or who knows
how many kB.

--Andy