Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes

From: Nadav Amit
Date: Tue Aug 27 2019 - 15:46:59 EST


> On Aug 27, 2019, at 11:28 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 8/23/19 3:52 PM, Nadav Amit wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
>> flush the user page-tables when PTI is enabled therefore introduces
>> significant overhead.
>
> I'm not sure this is worth all the churn, especially in the entry code.
> For large flushes (> tlb_single_page_flush_ceiling), we don't do
> INVPCIDs in the first place.

It is possible to jump from flush_tlb_func() into the trampoline page,
instead of flushing the TLB in the entry code. However, it induces higher
overhead (switching CR3s), so it will only be useful if multiple TLB entries
are flushed at once. It also prevents exploiting opportunities of promoting
individual entry flushes into a full-TLB flush when multiple flushes are
issued or when context switch takes place before returning-to-user-space.

There are cases/workloads that flush multiple (but not too many) TLB entries
on every syscall, for instance issuing msync() or running Apache webserver.
So I am not sure that tlb_single_page_flush_ceiling saves the day. Besides,
you may want to recalibrate (lower) tlb_single_page_flush_ceiling when PTI
is used.

> I'd really want to understand what the heck is going on that makes
> INVPCID so slow, first.

INVPCID-single is slow (even more than 133 cycles slower than INVLPG that
you mentioned; I donât have the numbers if front of me). I thought that this
is a known fact, although, obviously, it does not make much sense.