Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

From: Andy Lutomirski
Date: Sat Sep 09 2017 - 15:28:59 EST

On Sat, Sep 9, 2017 at 12:09 PM, Borislav Petkov <bp@xxxxxxxxx> wrote:
> On Sat, Sep 09, 2017 at 11:47:33AM -0700, Linus Torvalds wrote:
>> The thing is, even with the delayed TLB flushing, I don't think it
>> should be *so* delayed that we should be seeing a TLB fill from
>> garbage page tables.
> Yeah, but we can't know what kind of speculative accesses happen between
> the removal from the mask and the actual flushing.
>> But the part in Andy's patch that worries me the most is that
>> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>> in enter_lazy_tlb(). It means that we won't be notified by peopel
>> invalidating the page tables, and while we then do re-validate the TLB
>> when we switch back from lazy mode, I still worry. I'm not entirely
>> convinced by that tlb_gen logic.
>> I can't actually see anything *wrong* in the tlb_gen logic, but it worries me.
> Yeah, sounds like we're uncovering a situation of possibly stale
> mappings which we haven't had before. Or at least widening that window.
> And I still need to analyze what that MCE on Markus' machine is saying
> exactly. The TlbCacheDis thing is an optimization which does away with
> memory type checks. But we probably will have to disable it on those
> boxes as we can't guarantee pagetable elements are all in WB mem...
> Or we can guarantee them in WB but the lazy flushing delays the actual
> clearing of the TLB entries so much so that they end up pointing to
> garbage, as you say, which is not in WB mem and thus causes the protocol
> error.
> Hmm. All still wet.

I think it's my theory #3. The CPU has a "paging-structure cache"
(Intel lingo) that points to a freed page. The CPU speculatively
follows it and gets complete garbage, triggering this MCE and who
knows what else.

I propose the following fix. If PCID is on, then, in
enter_lazy_tlb(), we switch to init_mm with the no-flush flag set.
(And we give init_mm its own dedicated ASID to keep it simple and fast
-- no need to use the LRU ASID mapping to assign one dynamically.) We
clear the bit in mm_cpumask. That is, we more or less just skip the
whole lazy TLB optimization and rely on PCID CPUs having reasonably
fast CR3 writes. No extra IPIs. I suppose I need to benchmark this.
It will certainly slow down workloads that rapidly toggle between a
user thread and a kernel thread because it forces serialization on
each mm switch, but maybe that's not so bad.

If PCID is off, then we leave the old CR3 value when we go lazy, and
we also leave the flag in mm_cpumask set. When a flush is requested,
we send out the IPI and switch to init_mm (and flush because we have
no choice). IOW, the no-PCID behavior goes back to what it used to

For the PCID case, I'm relying on this language in the SDM (vol 3, 4.10):

When a logical processor creates entries in the TLBs (Section 4.10.2)
and paging-structure caches (Section
4.10.3), it associates those entries with the current PCID. When using
entries in the TLBs and paging-structure
caches to translate a linear address, a logical processor uses only
those entries associated with the current PCID
(see Section for an exception).

This is also just common sense -- a CPU that makes any assumptions
about a paging-structure cache for an inactive ASID is just nuts,
especially if it assumes that the result of following it is at all
sane. IOW, we really should be able to switch to ASID 1 and back to 0
without any flushes without worrying that the old page tables for ASID
1 might get freed afterwards. Obviously we need to flush if we switch
back to PCID 1, but the code already does this.

Also, sorry Rik, this means your old increased laziness optimization
is dead in the water. It will have exactly the same speculative load