Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible

From: Nadav Amit
Date: Mon Jun 06 2022 - 17:09:07 EST


On Jun 6, 2022, at 1:48 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:

> ⚠ External Email
>
> On Mon, Jun 6, 2022, at 11:01 AM, Nadav Amit wrote:
>> From: Nadav Amit <namit@xxxxxxxxxx>
>>
>> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
>> contended and reading it should (arguably) be avoided as much as
>> possible.
>>
>> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
>> even when it is not necessary (e.g., the mm was already switched).
>> This is wasteful.
>>
>> Moreover, one of the existing optimizations is to read mm's tlb_gen to
>> see if there are additional in-flight TLB invalidations and flush the
>> entire TLB in such a case. However, if the request's tlb_gen was already
>> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
>> by the overhead of the check itself.
>>
>> Running will-it-scale with tlb_flush1_threads show a considerable
>> benefit on 56-core Skylake (up to +24%):
>
> Acked-by: Andy Lutomirski <luto@xxxxxxxxxx>
>
> But...
>
> I'm suspicious that the analysis is missing something. Under this kind of workload, there are a whole bunch of flushes being initiated, presumably in parallel. Each flush does an RMW on mm_tlb_gen (which will make the cacheline exclusive on the initiating CPU). And each flush sends out an IPI, and the IPI handler reads mm_tlb_gen (which makes the cacheline shared) when it updates the local tlb_gen. So you're doing (at least!) an E->S and S->E transition per flush. Your patch doesn't change this.
>
> But your patch does add a whole new case in which the IPI handler simply doesn't flush! I think it takes either quite a bit of racing or a well-timed context switch to hit that case, but, if you hit it, then you skip a flush and you skip the read of mm_tlb_gen.
>
> Have you tested what happens if you do something like your patch but you also make the mm_tlb_gen read unconditional? I'm curious if there's more to the story than you're seeing.
>
> You could also contemplate a somewhat evil hack in which you don't read mm_tlb_gen even if you *do* flush and instead use f->new_tlb_gen. That would potentially do a bit of extra flushing but would avoid the flush path causing the E->S transition. (Which may be of dubious value for real workloads, since I don't think there's a credible way to avoid having context switches read mm_tlb_gen.)

Thanks Andy. I still think that the performance comes from saving cache
accesses, which are skipped in certain cases in this workload. I would note
that this patch comes from me profiling will-it-scale, after Dave complained
that I ruined the performance in some other patch. So this is not a random
“I tried something and it’s better”.

I vaguely remember profiling the number of cache-[something] and seeing an
effect, and I cannot explain such performance improvement by just skipping a
flush. But...

Having said all of that, I will run at least the first experiment that you
asked for. I was considering skipping reading mm_tlb_gen completely, but for
the reasons that you mentioned considered it as something that might
introduce performance regression for workloads that are more important than
will-it-scale.

I would also admit that I am not sure how to completely prevent speculative
read of mm->tlb_gen. I guess a serializing instruction is out of the
question, so this optimization is a best-effort.