Re: [PATCH v6 12/12] x86/traps: Fix up invalid PASID

From: Andy Lutomirski
Date: Mon Aug 03 2020 - 13:17:07 EST

On Mon, Aug 3, 2020 at 9:37 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
> On 8/3/20 8:12 AM, Andy Lutomirski wrote:
> > I could easily be convinced that the PASID fixup is so trivial and so
> > obviously free of misfiring in a way that causes an infinite loop that
> > this code is fine. But I think we first need to answer the bigger
> > question of why we're doing a lazy fixup in the first place.
> There was an (internal to Intel) implementation of this about a year ago
> that used smp_call_function_many() to force the MSR state into all
> threads of a process. I took one look at it, decided there was a 0%
> chance of it actually functioning and recommended we find another way.
> While I'm sure it could be done more efficiently, the implementation I
> looked at took ~200 lines of code and comments. It started to look too
> much like another instance of mm->cpumask for my comfort.

If I were implementing this, I would try making switch_mm_irqs_off()
do, roughly:

void load_mm_pasid(...) {
if (cpu_feature_enabled(X86_FEATURE_ENQCMD))
tsk->xstate[offset] = READ_ONCE(next->context.pasid);

This costs one cache miss, although the cache line in question is
about to be read anyway. It might be faster to, instead, do:

void load_mm_pasid(...) {
u32 pasid = READ_ONCE(next->context.pasid);

if (tsk->xstate[offset] != pasid)
tsk->state[offset] = pasid;

so we don't dirty the cache line in the common case. The actual
generated code ought to be pretty good -- surely the offset of PASID
in XSTATE is an entry in an array somewhere that can be found with a
single read, right?

The READ_ONCE is because this could race against a write to
context.pasid, so this code needs to be at the end of the function
where it's protected by mm_cpumask. With all this done, the pasid
update is just on_each_cpu_mask(mm_cpumask(mm), load_mm_pasid, mm,

This looks like maybe 20 lines of code. As an added bonus, it lets us
free PASIDs early if we ever decide we want to.

May I take this opportunity to ask Intel to please put some real
thought into future pieces of CPU state? Here's a summary of some
things we have:

- Normal extended state (FPU, XMM, etc): genuinely per thread and only
ever used explicitly. Actually makes sense with XSAVE(C/S).

- PKRU: affects CPL0-originated memory accesses, so it needs to be
eagerly loaded in the kernel. Does not make sense with XRSTOR(C/S),
but it's in there anyway.

- CR3: per-mm state. Makes some sense in CR3, but it's tangled up
with CR4 in nasty ways.

- LDTR: per-mm on Linux and mostly obsolete everyone. In it's own
register, so it's not a big deal.

- PASID: per-mm state (surely Intel always intended it to be per-mm,
since it's for shared _virtual memory_!). But for some reason it's in
an MSR (which is slow), and it's cleverly, but not that cleverly,
accessible with XSAVES/XRSTORS. Doesn't actually make sense. Also,
PASID is lazy-loadable, but the mechanism for telling the kernel that
a lazy load is needed got flubbed.

- TILE: genuinely per-thread, but it's expensive so it's
lazy-loadable. But the lazy-load mechanism reuses #NM, and it's not
fully disambiguated from the other use of #NM. So it sort of works,
but it's gross.

- "KERNEL_GS_BASE", i.e. the shadow GS base. This is logically
per-user-thread state, but it's only accessible in MSRs. For some
reason this is *not* in XSAVES/XRSTORS state, nor is there any
efficient way to access it at all.

- Segment registers: can't be properly saved except by hypervisors,
and can almost, but not quite, be properly loaded (assuming the state
was sane to begin with) by normal kernels. Just don't try to load 1,
2, or 3 into any of them.

Sometimes I think that this is all intended to be as confusing as
possible and that it's all a ploy to keep context switches slow and
complicated. Maybe Intel doesn't actually want to compete with other
architectures that can context switch quickly?

It would be really nice if we had a clean way to load per-mm state
(see my private emails about this), a clean way to load CPL3 register
state, and a clean way to load per-user-thread *kernel* register state
(e.g. PKRU and probably PKRS). And there should be an exception that
says "user code accessed a lazy-loaded resource that isn't loaded, and
this is the resource it tried to access".