Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI

From: Nadav Amit
Date: Tue Aug 27 2019 - 19:52:55 EST


> On Aug 27, 2019, at 4:18 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit <namit@xxxxxxxxxx> wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE, but it is
>> currently used to flush PTEs in the user page-table when PTI is used.
>>
>> Instead, it is possible to defer TLB flushes until after the user
>> page-tables are loaded. Preventing speculation over the TLB flushes
>> should keep the whole thing safe. In some cases, deferring TLB flushes
>> in such a way can result in more full TLB flushes, but arguably this
>> behavior is oftentimes beneficial.
>
> I have a somewhat horrible suggestion.
>
> Would it make sense to refactor this so that it works for user *and*
> kernel tables? In particular, if we flush a *kernel* mapping (vfree,
> vunmap, set_memory_ro, etc), we shouldn't need to send an IPI to a
> task that is running user code to flush most kernel mappings or even
> to free kernel pagetables. The same trick could be done if we treat
> idle like user mode for this purpose.
>
> In code, this could mostly consist of changing all the "user" data
> structures involved to something like struct deferred_flush_info and
> having one for user and one for kernel.
>
> I think this is horrible because it will enable certain workloads to
> work considerably faster with PTI on than with PTI off, and that would
> be a barely excusable moral failing. :-p
>
> For what it's worth, other than register clobber issues, the whole
> "switch CR3 for PTI" logic ought to be doable in C. I don't know a
> priori whether that would end up being an improvement.

I implemented (and have not yet sent) another TLB deferring mechanism. It is
intended for user mappings and not kernel one, but I think your suggestion
shares some similar underlying rationale, and therefore challenges and
solutions. Let me rephrase what you say to ensure we are on the same page.

The basic idea is context-tracking to check whether each CPU is in kernel or
user mode. Accordingly, TLB flushes can be deferred, but I donât see that
this solution is limited to PTI. There are 2 possible reasons, according to
my understanding, that you limit the discussion to PTI:

1. PTI provides clear boundaries when user and kernel mappings are used. I
am not sure that privilege-levels (and SMAP) do not do the same.

2. CR3 switching already imposes a memory barrier, which eliminates most of
the cost of implementing such scheme which requires something which is
similar to:

write new context (kernel/user)
mb();
if (need_flush) flush;

I do agree that PTI addresses (2), but there is another problem. A
reasonable implementation would store in a per-cpu state whether each CPU is
in user/kernel, and the TLB shootdown initiator CPU would check the state to
decide whether an IPI is needed. This means that pretty much every TLB
shutdown would incur a cache-miss per-target CPU. This might cause
performance regressions, at least in some cases.

Admittedly, I did implement something similar (not sent) for user mappings:
defer all TLB flushes and shootdowns if the CPUs are known to be in kernel
mode. But I limited myself for certain cases, specifically âlongâ syscalls
that are already likely to cause a TLB flush (e.g., msync()). I am not sure
that tracking each CPU entry/exit would be a good idea.

I will give some more thought about kernel mapping invalidations, which I
did not think about enough. I tried to send what I considered âsanerâ and
cleaner patches first. I still have patches I mentioned here, the
async-flushes, and another patch to avoid local TLB flush on CoW and instead
accessing the data.