Re: [PATCH 4/8] x86/traps: Demand-populate PASID MSR via #GP

From: Dave Hansen
Date: Tue Sep 28 2021 - 16:55:54 EST


On 9/28/21 1:28 PM, Luck, Tony wrote:
> On Tue, Sep 28, 2021 at 12:19:22PM -0700, Dave Hansen wrote:
>> On 9/28/21 11:50 AM, Luck, Tony wrote:
>>> On Mon, Sep 27, 2021 at 04:51:25PM -0700, Dave Hansen wrote:
>> ...
>>>> 1. Hide whether we need to write to real registers
>>>> 2. Hide whether we need to update the in-memory image
>>>> 3. Hide other FPU infrastructure like the TIF flag.
>>>> 4. Make the users deal with a *whole* state in the replace API
>>>
>>> Is that difference just whether you need to save the
>>> state from registers to memory (for the "update" case)
>>> or not (for the "replace" case ... where you can ignore
>>> the current register, overwrite the whole per-feature
>>> xsave area and mark it to be restored to registers).
>>>
>>> If so, just a "bool full" argument might do the trick?
>>
>> I want to be able to hide the complexity of where the old state comes
>> from. It might be in registers or it might be in memory or it might be
>> *neither*. It's possible we're running with stale register state and a
>> current->...->xsave buffer that has XFEATURES&XFEATURE_FOO 0.
>>
>> In that case, the "old" copy might be memcpy'd out of the init_task.
>> Or, for pkeys, we might build it ourselves with init_pkru_val.
>
> So should there be an error case if there isn't an "old" state, and
> the user calls:
>
> p = begin_update_one_xsave_feature(XFEATURE_something, false);
>
> Maybe instead of an error, just fill it in with the init state for the feature?

Yes, please. Let's not generate an error. Let's populate the init
state and let them move on with life.

>>> pseudo-code:
>>>
>>> void *begin_update_one_xsave_feature(enum xfeature xfeature, bool full)
>>> {
>>> void *addr;
>>>
>>> BUG_ON(!(xsave->header.xcomp_bv & xfeature));
>>>
>>> addr = __raw_xsave_addr(xsave, xfeature);
>>>
>>> fpregs_lock();
>>>
>>> if (full)
>>> return addr;
>>
>> If the feature is marked as in the init state in the buffer
>> (XSTATE_BV[feature]==0), this addr *could* contain total garbage. So,
>> we'd want to make sure that the memory contents have the init state
>> written before handing them back to the caller. That's not strictly
>> required if the user is writing the whole thing, but it's the nice thing
>> to do.
>
> Nice guys waste CPU cycles writing to memory that is just going to get
> written again.

Let's skip the "replace" operation for now and focus on "update". A
full replace *can* be faster because it doesn't require the state to be
written out. But, we don't need that for now.

Let's just do the "update" thing, and let's ensure that we reflect the
init state into the buffer that gets returned if the feature was in its
init state.

Sound good?

>>> if (xfeature registers are "live")
>>> xsaves(xstate, 1 << xfeature);
>>
>> One little note: I don't think we would necessarily need to do an XSAVES
>> here. For PKRU, for instance, we could just do a rdpkru.
>
> Like this?
>
> if (tsk == current) {
> switch (xfeature) {
> case XFEATURE_PKRU:
> *(u32 *)addr = rdpkru();
> break;
> case XFEATURE_PASID:
> rdmsrl(MSR_IA32_PASID, msr);
> *(u64 *)addr = msr;
> break;
> ... any other "easy" states ...
> default:
> xsaves(xstate, 1 << xfeature);
> break;
> }
> }

Yep.

>>> return addr;
>>> }
>>>
>>> void finish_update_one_xsave_feature(enum xfeature xfeature)
>>> {
>>> mark feature modified
>>
>> I think we'd want to do this at the "begin" time. Also, do you mean we
>> should set XSTATE_BV[feature]?
>
> Begin? End? It's all inside fpregs_lock(). But whatever seems best.

I'm actually waffling about it.

Does XSTATE_BV[feature] mean that state *is* there or that state *may*
be there? Either way works.

>> It's also worth noting that we *could*:
>>
>> xrstors(xstate, 1<<xfeature);
>>
>> as well. That would bring the registers back up to day and we could
>> keep TIF_NEED_FPU_LOAD==0.
>
> Only makes sense if "tsk == current". But does this help. The work seems
> to be the same whether we do it now, or later. We don't know for sure
> that we will directly return to the task. We might context switch to
> another task, so loading the state into registers now would just be
> wasted time.

True, but the other side of the coin is that setting TIF_NEED_FPU_LOAD
subjects us to an XRSTOR on the way out to userspace. That XRSTOR might
just be updating one state component in practice.

Either way, sorry for the distraction. We (me) should really be
focusing on getting something that is functional but slow.