Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible

From: Andy Lutomirski
Date: Mon Jun 06 2022 - 16:59:27 EST

Next message: Yury Norov: "Re: [PATCH 4/6] bitops: unify non-atomic bitops prototypes across architectures"
Previous message: Minchan Kim: "Re: [PATCH] zram_drv: add __GFP_NOMEMALLOC not to use ALLOC_NO_WATERMARKS"
In reply to: Nadav Amit: "[PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible"
Next in thread: Nadav Amit: "Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Jun 6, 2022, at 11:01 AM, Nadav Amit wrote:
> From: Nadav Amit <namit@xxxxxxxxxx>
>
> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
> contended and reading it should (arguably) be avoided as much as
> possible.
>
> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
> even when it is not necessary (e.g., the mm was already switched).
> This is wasteful.
>
> Moreover, one of the existing optimizations is to read mm's tlb_gen to
> see if there are additional in-flight TLB invalidations and flush the
> entire TLB in such a case. However, if the request's tlb_gen was already
> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
> by the overhead of the check itself.
>
> Running will-it-scale with tlb_flush1_threads show a considerable
> benefit on 56-core Skylake (up to +24%):

Acked-by: Andy Lutomirski <luto@xxxxxxxxxx>

But...

I'm suspicious that the analysis is missing something. Under this kind of workload, there are a whole bunch of flushes being initiated, presumably in parallel. Each flush does an RMW on mm_tlb_gen (which will make the cacheline exclusive on the initiating CPU). And each flush sends out an IPI, and the IPI handler reads mm_tlb_gen (which makes the cacheline shared) when it updates the local tlb_gen. So you're doing (at least!) an E->S and S->E transition per flush. Your patch doesn't change this.

But your patch does add a whole new case in which the IPI handler simply doesn't flush! I think it takes either quite a bit of racing or a well-timed context switch to hit that case, but, if you hit it, then you skip a flush and you skip the read of mm_tlb_gen.

Have you tested what happens if you do something like your patch but you also make the mm_tlb_gen read unconditional? I'm curious if there's more to the story than you're seeing.

You could also contemplate a somewhat evil hack in which you don't read mm_tlb_gen even if you *do* flush and instead use f->new_tlb_gen. That would potentially do a bit of extra flushing but would avoid the flush path causing the E->S transition. (Which may be of dubious value for real workloads, since I don't think there's a credible way to avoid having context switches read mm_tlb_gen.)

--Andy

Next message: Yury Norov: "Re: [PATCH 4/6] bitops: unify non-atomic bitops prototypes across architectures"
Previous message: Minchan Kim: "Re: [PATCH] zram_drv: add __GFP_NOMEMALLOC not to use ALLOC_NO_WATERMARKS"
In reply to: Nadav Amit: "[PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible"
Next in thread: Nadav Amit: "Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]