Re: [PATCH v3 07/21] x86/fpu/xstate: Introduce helpers to manage dynamic xstate buffers

From: Bae, Chang Seok
Date: Tue Jan 26 2021 - 22:53:16 EST


On Jan 26, 2021, at 12:17, Borislav Petkov <bp@xxxxxxx> wrote:
> On Wed, Dec 23, 2020 at 07:57:03AM -0800, Chang S. Bae wrote:
>>
>> + /*
>> + * @state_mask:
>> + *
>> + * The state component bitmap. It indicates the saved xstate in
>> + * either @state or @state_ptr. The map value starts to be aligned
>> + * with @state and then with @state_ptr once it is in use.
>
> Are you trying to say here that the mask describes the state saved in
> @state initially and then, when the task is switched to dynamic state,
> it denotes the state in ->state_ptr?

Yes, it is. I will take your sentence in the comment. Thank you.

>> + */
>> + u64 state_mask;
>> +
>> + /*
>> + * @state_ptr:
>> + *
>> + * Copy of all extended register states, in a dynamically allocated
>> + * buffer. When a task is using extended features, the register state
>> + * is always the most current. This state copy is more recent than
>> + * @state. If the task context-switches away, they get saved here,
>> + * representing the xstate.
>
> Calling it a copy here is confusing - you wanna say that when dynamic
> states get used, the state in state_ptr supercedes and invalidates the
> state in @state. AFAIU, at least.

True, it looks better here too.

>> +DEFINE_EVENT(x86_fpu, x86_fpu_xstate_alloc_failed,
>> + TP_PROTO(struct fpu *fpu),
>> + TP_ARGS(fpu)
>> +);
>> +
>
> Huh, what's that for?

This tracepoint can point to the allocation failure even with the NMI handling
failure message only. (You can also check the comment below at the call site.)

>> /*
>> * Although we spell it out in here, the Processor Trace
>> @@ -71,6 +73,7 @@ static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] =
>> static unsigned int xstate_sizes[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
>> static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
>> static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
>> +static bool xstate_aligns[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = false};
>
> What's that for?

The xstate buffer may expand on the fly. The size has to be correctly
calculated if needed. CPUID provides essential information for the
calculation. Instead of reading CPUID repeatedly, store them -- the offset and
size are already stored here. The 64B alignment looks to be missing, so added
here.

>> + /*
>> + * Calculate the size by summing up each state together, since no known
>> + * size found with the xstate buffer format out of the given mask.
>> + */
>
> I barely can imagine what that comment is trying to tell me...

How about:
“With the given mask, no relevant size is found so far. So, calculate it by
summing up each state size."

>> +/* The watched threshold size of dynamically allocated xstate buffer */
>
> Watched?

Maybe:
"When the buffer is more than this size, the current mechanism is
potentially marginal to support the allocations."

>> +#define XSTATE_BUFFER_MAX_BYTES (64 * 1024)
>
> What's that thing for when we have fpu_kernel_xstate_max_size too?

The threshold size is what the current mechanism can comfortably allocate
(maybe at most). The warning is left when the buffer size goes beyond the
threshold. Then, we may need to consider a better allocation mechanism.

>> static int __init init_xstate_size(void)
>> {
>> /* Recompute the context size for enabled features: */
>> @@ -779,6 +830,14 @@ static int __init init_xstate_size(void)
>> if (!is_supported_xstate_size(fpu_kernel_xstate_min_size))
>> return -EINVAL;
>>
>> + /*
>> + * When allocating buffers larger than the threshold, a more sophisticated
>> + * mechanism might be considerable.
>> + */
>> + if (fpu_kernel_xstate_max_size > XSTATE_BUFFER_MAX_BYTES)
>> + pr_warn("x86/fpu: xstate buffer too large (%u > %u)\n",
>> + fpu_kernel_xstate_max_size, XSTATE_BUFFER_MAX_BYTES);
>
> So why doesn't this return an error?

Although a warning is given, vmalloc() may manage to allocate this size. So,
it was not considered a hard hit yet. vmalloc() failure will return an error
later.

>> /*
>> * User space is always in standard format.
>> */
>> @@ -869,6 +928,9 @@ void __init fpu__init_system_xstate(void)
>> if (err)
>> goto out_disable;
>>
>> + /* Make sure init_task does not include the dynamic user states */
>> + current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
>
> xfeatures_mask_user_dynamic just got set to 0 a couple of lines above...

Well, it will have some values when the piece in place to support the dynamic
user state. PATCH13 has this change there:

+ for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+ u64 feature_mask = BIT_ULL(i);
+
+ if (!(xfeatures_mask_user() & feature_mask))
+ continue;
+
+ if (xfeature_disable_supported(i))
+ xfeatures_mask_user_dynamic |= feature_mask;
+ }

>> +/*
>> + * Allocate an xstate buffer with the size calculated based on 'mask'.
>> + *
>> + * The allocation mechanism does not shrink or reclaim the buffer.
>> + */
>> +int alloc_xstate_buffer(struct fpu *fpu, u64 mask)
>> +{
>> + union fpregs_state *state_ptr;
>> + unsigned int oldsz, newsz;
>> + u64 state_mask;
>> +
>> + state_mask = fpu->state_mask | mask;
>> +
>> + oldsz = get_xstate_size(fpu->state_mask);
>> + newsz = get_xstate_size(state_mask);
>> +
>> + if (oldsz >= newsz)
>> + return 0;
>> +
>> + if (newsz > fpu_kernel_xstate_max_size) {
>> + pr_warn_once("x86/fpu: xstate buffer too large (%u > %u bytes)\n",
>> + newsz, fpu_kernel_xstate_max_size);
>> + XSTATE_WARN_ON(1);
>> + return 0;
>
> return 0?!? On an error?!?

Okay, the first question is whether this is an error. Well, with such too-much
size though, the buffer can still store the states. So, give a warning at
least. Perhaps, a similar case is when the calculated size is unmatched with
the CPUID-provided [3]. We give a warning, not an error there, maybe assuming
the calculated is larger.

But if it should be considered an error, maybe return -EINVAL.

>> + }
>> +
>> + /* We need 64B aligned pointer, but vmalloc() returns a page-aligned address. */
>
> So this comment is useless, basically...

Okay, removed.

>> + state_ptr = vmalloc(newsz);
>> + if (!state_ptr) {
>> + trace_x86_fpu_xstate_alloc_failed(fpu);
>
> WTH is that tracepoint here for?

While it returns an error, this function can be on the path of NMI handling.
Then, likely only with the “unexpected #NM exception” message. So, logging a
tracepoint can provide evidence of the allocation failure in that case.

The comments on v1 [1][2] were received as such change.

>> + return -ENOMEM;
>> + }
>> +
>> + memset(state_ptr, 0, newsz);
>
> So vzalloc() above?

Yes, I think it is better to use vzalloc() here.

> I must be missing something here but where's the logic that decides
> between the static and dynamic buffer? Later patches?
>
> I have to admit I've yet to see how the "switching" between static and
> dynamic state happens…

PATCH9 introduces a wrapper that determines which to take. It simply returns
state_ptr when not a null pointer. So, the logic is to use the dynamic buffer
when available.

Thanks,
Chang

[1] https://lore.kernel.org/lkml/69721125-4e1c-ca9c-ff59-8e1331933e6c@xxxxxxxxx/#t
[2] https://lore.kernel.org/lkml/20201014104148.GD2628@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n657