Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

From: Dave Hansen
Date: Mon May 03 2021 - 10:14:52 EST


On 5/3/21 6:47 AM, Florian Weimer wrote:
> * Dave Hansen:
>
>> On 5/2/21 10:18 PM, Florian Weimer wrote:
>>>> 5. If the feature is enabled in XCR0, the user happily uses it.
>>>>
>>>> For AMX, Linux implements "transparent first use"
>>>> so that it doesn't have to allocate 8KB context switch
>>>> buffers for tasks that don't actually use AMX.
>>>> It does this by arming XFD for all tasks, and taking a #NM
>>>> to allocate a context switch buffer only for those tasks
>>>> that actually execute AMX instructions.
>>> What happens if the kernel cannot allocate that additional context
>>> switch buffer?
>> Well, it's vmalloc()'d and currently smaller that the kernel stack,
>> which is also vmalloc()'d. While it can theoretically fail, if it
>> happens you have bigger problems on your hands.
> Not sure if I understand.
>
> Is your position that the kernel should terminate processes if it runs
> out of memory instead reporting proper errors, even if memory overcommit
> is disabled?

I assume you mean sysctl vm.overcommit=2 by "overcommit is disabled"?

> When this flag is 2, the kernel uses a "never overcommit"
> policy that attempts to prevent any overcommit of memory.
> Note that user_reserve_kbytes affects this policy.

Note the "attempts".

So, no, the kernel should not be terminating processes when it runs out
of memory. It *attempts* not to do that. What you are seeing here with
a demand-based XSAVE buffer allocation driven by a #NM fault is the
*addition* of a case where those attempts can fail, not the creation of
the first one.

The addition of this case doesn't bother me because I don't think it
will ultimately be visible to end users.

If I'm wrong, and our HPC friends who are so enamored with
"vm.overcommit=2" end up seeing lots of SIGSEGV's where where would
rather see syscall failures, there's an easy solution: disable first-use
detection. Stop dynamically allocating XSAVE buffers on faults.

Actually, if we don't have a tunable or boot parameter for that now, we
should add one.