Re: [RFC PATCH] x86/mm: Flush more aggressively in lazy TLB mode

From: Borislav Petkov
Date: Mon Oct 09 2017 - 13:37:05 EST


On Mon, Oct 09, 2017 at 07:02:31PM +0200, Borislav Petkov wrote:
> From 29a6426c20f25b8df767356a04727cb113e8e784 Mon Sep 17 00:00:00 2001
> From: Andy Lutomirski <luto@xxxxxxxxxx>
> Date: Mon, 9 Oct 2017 09:50:49 -0700
> Subject: [PATCH] x86/mm: Flush more aggressively in lazy TLB mode
>
> Since commit 94b1b03b519b, x86's lazy TLB mode has been all the way

Write it out:

Since commit

94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")

x86's lazy...

> lazy: when running a kernel thread (including the idle thread), the
> kernel keeps using the last user mm's page tables without attempting
> to maintain user TLB coherence at all. From a pure semantic
> perspective, this is fine -- kernel threads won't attempt to access
> user pages, so having stale TLB entries doesn't matter.
>
> Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
> we also allow any paging-structure caches that may exist on the CPU
> to become incoherent. This means that we can have a
> paging-structure cache entry that references a freed page table, and
> the CPU is within its rights to do a speculative page walk starting
> at the freed page table.
>
> I can imagine this causing two different problems:
>
> - A speculative page walk starting from a bogus page table could read
> IO addresses. I haven't seen any reports of this causing problems.
>
> - A speculative page walk that involves a bogus page table can install
> garbage in the TLB. Such garbage would always be at a user VA, but
> some AMD CPUs have logic that triggers a machine check when it notices
> these bogus entries. I've seen a couple reports of this.

It is actually more of an optimization which assumes that paging-structure
entries are in WB DRAM:

"TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
performance optimization that assumes PML4, PDP, PDE, and PTE entries
are in cacheable WB-DRAM; memory type checks may be bypassed, and
addresses outside of WB-DRAM may result in undefined behavior or NB
protocol errors. 1=Disables performance optimization and allows PML4,
PDP, PDE and PTE entries to be in any memory type. Operating systems
that maintain page tables in memory types other than WB- DRAM must set
TlbCacheDis to insure proper operation."

The MCE generated is an NB protocol error to signal that

"Link: A specific coherent-only packet from a CPU was issued to an
IO link. This may be caused by software which addresses page table
structures in a memory type other than cacheable WB-DRAM without
properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
example, when page table structure addresses are above top of memory. In
such cases, the NB will generate an MCE if it sees a mismatch between
the memory operation generated by the core and the link type."

I'm assuming coherent-only packets don't go out on IO links, thus the
error.

> Reinstate TLB coherence in lazy mode. With this patch applied, we
> do it in one of two ways. If we have PCID, we simply switch back to
> init_mm's page tables when we enter a kernel thread -- this seems to
> be quite cheap except for the cost of serializing the CPU. If we
> don't have PCID, then we set a flag and switch to init_mm the first
> time we would otherwise need to flush the TLB.
>
> /sys/kernel/debug/x86/tlb_use_lazy_mode can be changed to override
> the default mode for benchmarking.

So this should be tlb_full_lazy_mode, no?

Because we are lazy before but not "all the way" as you say above...

And frankly, I'm not really crazy about this benchmarking aspect. Why
would you have this in every kernel? I mean, it could be a patch ontop
for people to measure on boxes but it is not really worth it. I mean,
the PCID fun only showed some small improvement in some microbenchmark
and nothing earth-shattering so having it everywhere to get some meh
numbers is not really worth the trouble.

Besides, this TLB code is already complex as hell.

IOW, you could simplify this by removing the static key and adding it
with a patch ontop for people who wanna play with this. For the majority
of the systems and use cases it doesn't really matter, IMO.

> In theory, we could optimize this better by only flushing the TLB in
> lazy CPUs when a page table is freed. Doing that would require
> auditing the mm code to make sure that all page table freeing goes
> through tlb_remove_page() as well as reworking some data structures
> to implement the improved flush logic.
>
> Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
> Reported-by: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
> Reported-by: Adam Borowski <kilobyte@xxxxxxxxxx>
> Cc: Brian Gerst <brgerst@xxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxxxx>
> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
> Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
> Cc: Adam Borowski <kilobyte@xxxxxxxxxx>
> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxx>
> Cc: Brian Gerst <brgerst@xxxxxxxxx>
> Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> Cc: Eric Biggers <ebiggers@xxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Kees Cook <keescook@xxxxxxxxxxxx>
> Cc: Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx>
> Cc: Nadav Amit <nadav.amit@xxxxxxxxx>
> Cc: Rik van Riel <riel@xxxxxxxxxx>
> Cc: Roman Kagan <rkagan@xxxxxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: lkml <linux-kernel@xxxxxxxxxxxxxxx>
> Cc: x86-ml <x86@xxxxxxxxxx>
> Link: http://lkml.kernel.org/r/8eccc9240041ea7cb94624cab8d07e2a6e911ba7.1507567665.git.luto@xxxxxxxxxx
> Signed-off-by: Borislav Petkov <bp@xxxxxxx>
> ---
> arch/x86/include/asm/mmu_context.h | 8 +-
> arch/x86/include/asm/tlbflush.h | 24 ++++++
> arch/x86/mm/tlb.c | 153 +++++++++++++++++++++++++++----------
> 3 files changed, 136 insertions(+), 49 deletions(-)
>
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index c120b5db178a..3c856a15b98e 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -126,13 +126,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
> DEBUG_LOCKS_WARN_ON(preemptible());
> }
>
> -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
> -{
> - int cpu = smp_processor_id();
> -
> - if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> - cpumask_clear_cpu(cpu, mm_cpumask(mm));
> -}
> +void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
>
> static inline int init_new_context(struct task_struct *tsk,
> struct mm_struct *mm)
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 4893abf7f74f..d362161d3291 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -83,6 +83,13 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
> #endif
>
> /*
> + * If tlb_use_lazy_mode is true, then we try to avoid switching CR3 to point
> + * to init_mm when we switch to a kernel thread (e.g. the idle thread). If
> + * it's false, then we immediately switch CR3 when entering a kernel thread.
> + */
> +DECLARE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
> +
> +/*
> * 6 because 6 should be plenty and struct tlb_state will fit in
> * two cache lines.
> */
> @@ -105,6 +112,23 @@ struct tlb_state {
> u16 next_asid;
>
> /*
> + * We can be in one of several states:
> + *
> + * - Actively using an mm. Our CPU's bit will be set in
> + * mm_cpumask(loaded_mm) and is_lazy == false;
> + *
> + * - Not using a real mm. loaded_mm == &init_mm. Our CPU's bit
> + * will not be set in mm_cpumask(&init_mm) and is_lazy == false.

And this is the old lazy mode, right?

I think we should state what lazy we mean ...

> + * - Lazily using a real mm. loaded_mm != &init_mm, our bit
> + * is set in mm_cpumask(loaded_mm), but is_lazy == true.
> + * We're heuristically guessing that the CR3 load we
> + * skipped more than makes up for the overhead added by
> + * lazy mode.
> + */
> + bool is_lazy;


> +
> + /*
> * Access to this CR4 shadow and to H/W CR4 is protected by
> * disabling interrupts when modifying either one.
> */
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 49d9778376d7..658bf0090565 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -30,6 +30,8 @@
>
> atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
>
> +DEFINE_STATIC_KEY_TRUE(tlb_use_lazy_mode);
> +
> static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> u16 *new_asid, bool *need_flush)
> {
> @@ -80,7 +82,7 @@ void leave_mm(int cpu)
> return;
>
> /* Warn if we're not lazy. */
> - WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
> + WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
>
> switch_mm(NULL, &init_mm, NULL);
> }
> @@ -142,45 +144,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> __flush_tlb_all();
> }
> #endif
> + this_cpu_write(cpu_tlbstate.is_lazy, false);
>
> if (real_prev == next) {
> VM_BUG_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> next->context.ctx_id);
>
> - if (cpumask_test_cpu(cpu, mm_cpumask(next))) {
> - /*
> - * There's nothing to do: we weren't lazy, and we
> - * aren't changing our mm. We don't need to flush
> - * anything, nor do we need to update CR3, CR4, or
> - * LDTR.
> - */
> - return;
> - }
> -
> - /* Resume remote flushes and then read tlb_gen. */
> - cpumask_set_cpu(cpu, mm_cpumask(next));
> - next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> -
> - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) <
> - next_tlb_gen) {
> - /*
> - * Ideally, we'd have a flush_tlb() variant that
> - * takes the known CR3 value as input. This would
> - * be faster on Xen PV and on hypothetical CPUs
> - * on which INVPCID is fast.
> - */
> - this_cpu_write(cpu_tlbstate.ctxs[prev_asid].tlb_gen,
> - next_tlb_gen);
> - write_cr3(build_cr3(next, prev_asid));
> - trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
> - TLB_FLUSH_ALL);
> - }
> -
> /*
> - * We just exited lazy mode, which means that CR4 and/or LDTR
> - * may be stale. (Changes to the required CR4 and LDTR states
> - * are not reflected in tlb_gen.)
> + * We don't currently support having a real mm loaded without
> + * our cpu set in mm_cpumask(). We have all the bookkeeping

s/cpu/CPU/

> + * in place to figure out whether we would need to flush
> + * if our cpu were cleared in mm_cpumask(), but we don't

ditto.

> + * currently use it.
> */
> + if (WARN_ON_ONCE(real_prev != &init_mm &&
> + !cpumask_test_cpu(cpu, mm_cpumask(next))))
> + cpumask_set_cpu(cpu, mm_cpumask(next));
> +
> + return;
> } else {
> u16 new_asid;
> bool need_flush;
--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.