Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

From: Len Brown
Date: Mon Mar 29 2021 - 18:39:03 EST


On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
>
> > On Mar 29, 2021, at 8:47 AM, Len Brown <lenb@xxxxxxxxxx> wrote:
> >
> > On Sat, Mar 27, 2021 at 5:58 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> >>> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote:
> >>> Hi Andy,
> >>> Say a mainline links with a math library that uses AMX without the
> >>> knowledge of the mainline.
> >
> > sorry for the confusion.
> >
> > mainline = main().
> >
> > ie. the part of the program written by you, and not the library you linked with.
> >
> > In particular, the library may use instructions that main() doesn't know exist.
>
> If we pretend for a bit that AMX were a separate device instead of a part of the CPU, this would be a no brainer: something would be responsible for opening a device node or otherwise requesting access to the device.
>
> Real AMX isn’t so different. Programs acquire access either by syscall or by a fault, they use it, and (hopefully) they release it again using TILERELEASE. The only thing special about it is that, supposedly, acquiring and releasing access (at least after the first time) is quite fast. But holding access is *not* free — despite all your assertions to the contrary, the kernel *will* correctly context switch it to avoid blowing up power consumption, and this will have overhead.
>
> We’ve seen the pattern of programs thinking that, just because something is a CPU insn, it’s free and no thought is needed before using it. This happened with AVX and AVX512, and it will happen again with AMX. We *still* have a significant performance regression in the kernel due to screwing up the AVX state machine, and the only way I know about any of the details is that I wrote silly test programs to try to reverse engineer the nonsensical behavior of the CPUs.
>
> I might believe that Intel has figured out how to make a well behaved XSTATE feature after Intel demonstrates at least once that it’s possible. That means full documentation of all the weird issues, no new special cases, and the feature actually making sense in the context of XSTATE. This has not happened. Let’s list all of them:
>
> - SSE. Look for all the MXCSR special cases in the pseudocode and tell me with a straight face that this one works sensibly.
>
> - AVX. Also has special cases in the pseudocode. And has transition issues that are still problems and still not fully documented. L
>
> - AVX2. Horrible undocumented performance issues. Otherwise maybe okay?
>
> - MPX: maybe the best example, but the compat mode part got flubbed and it’s MPX.
>
> - PKRU: Should never have been in XSTATE. (Also, having WRPKRU in the ISA was a major mistake, now unfixable, that seriously limits the usefulness of the whole feature. I suppose Intel could release PKRU2 with a better ISA and deprecate the original PKRU, but I’m not holding my breath.)
>
> - AVX512: Yet more uarch-dependent horrible performance issues, and Intel has still not responded about documentation. The web is full of people speculating differently about when, exactly, using AVX512 breaks performance. This is NAKked in kernel until docs arrive. Also, it broke old user programs. If we had noticed a few years ago, AVX512 enablement would have been reverted.
>
> - AMX: This mess.
>
> The current system of automatic user enablement does not work. We need something better.

Hi Andy,

Can you provide a concise definition of the exact problemI(s) this thread
is attempting to address?

Thank ahead-of-time for excluding "blow up power consumption",
since that paranoia is not grounded in fact.

thanks,
-Len