[RFC PATCH v2 1/2] x86/fpu: detect AVX task

From: Dave Hansen
Date: Mon Nov 12 2018 - 10:46:29 EST


On 11/11/18 9:38 PM, Li, Aubrey wrote:
> If there is a valid state in the AVX registers, we can say the tasks contains
> AVX instructions, can't we?

XRSTOR, for instance, can take XSAVE state out of the init state, but it
is not necessarily an AVX instruction. In fact, we had a kernel bug
along the way that was accidentally doing this on systems without AVX.

>>> +#define AVX_STATE_DECAY_COUNT 3
>>
>> How was this number chosen? What does this mean?
>>
>> It appears that this is saying that after 3 non-AVX-512-using context
>> switches, the task is not considered to be using AVX512 any more. That
>> seems a bit goofy because the context switch rate is highly dependent on
>> HZ, and on how often the task yields.
>>
>> Do we want this, or do we want something more time-based?
>>
> This counter is introduced here to solve the race of context switch and
> VZEROUPPER. 3 context switches mean the same thread is on-off CPU 3 times.
> Due to scheduling latency, 3 jiffies could only happen AVX task on-off just
> 1 time. So IMHO the context switches number is better here.

Imagine we have a HZ=1000 system where AVX_STATE_DECAY_COUNT=3. That
means that a task can be marked as a non-AVX-512-user after not using it
for ~3 ms. But, with HZ=250, that's ~12ms.

Also, don't forget that we have context switches from the timer
interrupt, but also from normal old operations that sleep.

Let's say our AVX-512 app was doing:

while (foo) {
do_avx_512();
read(pipe, buf, len);
read(pipe, buf, len);
read(pipe, buf, len);
}

And all three pipe reads context-switched the task. That loop could
finish in way under 3HZ, but still end up in do_avx_512() each time with
fpu...avx->state=0.

BTW, I don't have a great solution for this. I was just pointing out
one of the pitfalls from using context switch counts so strictly.

>> I'd really like if this looked like this:
>>
>> if (!have_avx_state_detect()) {
>> avx->state = 0;
>> return;
>> }
>>
>> Then, in have_avx_state_detect(), explain what XGETBV1 does. BTW, I
>> don't think we *totally* need to duplicate the SDM definitions in kernel
>> code for each instruction. It's fine to just say that it set 1 for
>> features not in the init state.
>>
>
> Thomas suggested open this inline since this is only usage of xgetbv1. So I'll
> use static_cpu_has() here directly. Does it sound good?

Yeah, that's OK. But, I'd still limit the comments to what it is doing
instead of defining the instruction.

>>> + /*
>>> + * XINUSE is dynamic to track component state because VZEROUPPER
>>> + * happens on every function end and reset the bitmap to the
>>> + * initial configuration.
>>
>> This is confusing to me because VZEROUPPER is not apparently involved
>> here. What is this trying to say?
>>
> VZERROUPPER instruction in the task resets the bitmap.

I think this is too much detail for a kernel comment. Let's just say
that the task itself can get back to the 'init state'.

>>> + * State decay is introduced to solve the race condition between
>>> + * context switch and a function end. State is aggressively set
>>> + * once it's detected but need to be cleared by decay 3 context
>>> + * switches
>>> + */
>>
>> I'd probably say:
>>
>> AVX512-using processes frequently set AVX512 back to the init
>> state themselves. Thus, this detection mechanism can miss.
>> The decay ensures that false-negatives do not immediately make
>> a task be considered as not using AVX512.
>
> Thanks, will refine it in the next version.
> And yes, AVX512-using processoes set AVX512 back to the init state themselves
> by VZEROUPPER instructions.

... or XRSTOR, or an in-kernel XRSTOR from a signal entry.

The point is, I'd probably not actually mention VZEROUPPER. It's
certainly the common way to init the state, but let's not the *only* way
we have to be concerned with.

>>> + if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM) {
>>
>> This is *just* an AVX512 state, right? That isn't reflected in any
>> comments or naming. Also, why just this state, and not any of the other
>> AVX512-related states?
>
> Only this state causes significant frequency drop, other states not.

OK. This is an extremely crucial bit of information to capture,
including the background.

>>> +/*
>>> * Highest level per task FPU state data structure that
>>> * contains the FPU register state plus various FPU
>>> * state fields:
>>> @@ -303,6 +312,14 @@ struct fpu {
>>> unsigned char initialized;
>>>
>>> /*
>>> + * @avx_state:
>>> + *
>>> + * This data structure indicates whether this context
>>> + * contains AVX states
>>> + */
>>
>> Yeah, that's precisely what fpu->state.xsave.xfeatures does. :)
>> I see, will refine in the next version

One other thought about the new 'avx_state':

fxregs_state (which is a part of the XSAVE state) has some padding and
'sw_reserved' areas. You *might* be able to steal some space there.
Not that this is a huge space eater, but why waste the space if we don't
have to?