Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm

From: Rik van Riel
Date: Fri Jun 01 2018 - 18:13:48 EST

On Fri, 1 Jun 2018 14:21:58 -0700
Andy Lutomirski <luto@xxxxxxxxxx> wrote:

> Hmm. I wonder if there's a more clever data structure than a bitmap
> that we could be using here. Each CPU only ever needs to be in one
> mm's cpumask, and each cpu only ever changes its own state in the
> bitmask. And writes are much less common than reads for most
> workloads.

It would be easy enough to add an mm_struct pointer to the
per-cpu tlbstate struct, and iterate over those.

However, that would be an orthogonal change to optimizing
lazy TLB mode.

Does the (untested) patch below make sense as a potential
improvement to the lazy TLB heuristic?

Subject: x86,tlb: workload dependent per CPU lazy TLB switch

Lazy TLB mode is a tradeoff between flushing the TLB and touching
the mm_cpumask(&init_mm) at context switch time, versus potentially
incurring a remote TLB flush IPI while in lazy TLB mode.

Whether this pays off is likely to be workload dependent more than
anything else. However, the current heuristic keys off hardware type.

This patch changes the lazy TLB mode heuristic to a dynamic, per-CPU
decision, dependent on whether we recently received a remote TLB
shootdown while in lazy TLB mode.

This is a very simple heuristic. When a CPU receives a remote TLB
shootdown IPI while in lazy TLB mode, a counter in the same cache
line is set to 16. Every time we skip lazy TLB mode, the counter
is decremented.

While the counter is zero (no recent TLB flush IPIs), allow lazy TLB mode.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
arch/x86/include/asm/tlbflush.h | 32 ++++++++++++++++----------------
arch/x86/mm/tlb.c | 4 ++++
2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6690cd3fc8b1..f06a934e317d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
#define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)

-static inline bool tlb_defer_switch_to_init_mm(void)
- /*
- * If we have PCID, then switching to init_mm is reasonably
- * fast. If we don't have PCID, then switching to init_mm is
- * quite slow, so we try to defer it in the hopes that we can
- * avoid it entirely. The latter approach runs the risk of
- * receiving otherwise unnecessary IPIs.
- *
- * This choice is just a heuristic. The tlb code can handle this
- * function returning true or false regardless of whether we have
- * PCID.
- */
- return !static_cpu_has(X86_FEATURE_PCID);
struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
@@ -179,6 +163,7 @@ struct tlb_state {
struct mm_struct *loaded_mm;
u16 loaded_mm_asid;
u16 next_asid;
+ u16 flushed_while_lazy;
/* last user mm's ctx id */
u64 last_ctx_id;

@@ -246,6 +231,21 @@ struct tlb_state {
DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);

+static inline bool tlb_defer_switch_to_init_mm(void)
+ /*
+ * If the CPU recently received a TLB flush IPI while in lazy
+ * TLB mode, do a straight switch to the idle task, and skip
+ * lazy TLB mode for now.
+ */
+ if (this_cpu_read(cpu_tlbstate.flushed_while_lazy)) {
+ this_cpu_dec(cpu_tlbstate.flushed_while_lazy);
+ return false;
+ }
+ return true;
/* Initialize cr4 shadow for this CPU. */
static inline void cr4_init_shadow(void)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e055d1a06699..d8b0b7b236f3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -454,7 +454,11 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* paging-structure cache to avoid speculatively reading
* garbage into our TLB. Since switching to init_mm is barely
* slower than a minimal flush, just switch to init_mm.
+ *
+ * Skip lazy TLB mode for the next 16 context switches,
+ * in case more TLB flush IPIs are coming.
+ this_cpu_write(cpu_tlbstate.flushed_while_lazy, 16);
switch_mm_irqs_off(NULL, &init_mm, NULL);