Re: [PATCH v2] x86/mm/tlb: avoid reading mm_tlb_gen when possible

From: Nadav Amit
Date: Fri Jul 08 2022 - 03:00:04 EST


On Jul 7, 2022, at 10:56 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote:

> On Jul 7, 2022, at 9:23 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>
>> On Jul 7, 2022, at 8:27 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
>>
>>> On Mon, 6 Jun 2022, Nadav Amit wrote:
>>>
>>>> From: Nadav Amit <namit@xxxxxxxxxx>
>>>>
>>>> On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
>>>> contended and reading it should (arguably) be avoided as much as
>>>> possible.
>>>>
>>>> Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
>>>> even when it is not necessary (e.g., the mm was already switched).
>>>> This is wasteful.
>>>>
>>>> Moreover, one of the existing optimizations is to read mm's tlb_gen to
>>>> see if there are additional in-flight TLB invalidations and flush the
>>>> entire TLB in such a case. However, if the request's tlb_gen was already
>>>> flushed, the benefit of checking the mm's tlb_gen is likely to be offset
>>>> by the overhead of the check itself.
>>>>
>>>> Running will-it-scale with tlb_flush1_threads show a considerable
>>>> benefit on 56-core Skylake (up to +24%):
>>>>
>>>> threads Baseline (v5.17+) +Patch
>>>> 1 159960 160202
>>>> 5 310808 308378 (-0.7%)
>>>> 10 479110 490728
>>>> 15 526771 562528
>>>> 20 534495 587316
>>>> 25 547462 628296
>>>> 30 579616 666313
>>>> 35 594134 701814
>>>> 40 612288 732967
>>>> 45 617517 749727
>>>> 50 637476 735497
>>>> 55 614363 778913 (+24%)
>>>>
>>>> Acked-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>>>> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>>>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>>>> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
>>>> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>>>> Cc: x86@xxxxxxxxxx
>>>> Signed-off-by: Nadav Amit <namit@xxxxxxxxxx>
>>>>
>>>> --
>>>>
>>>> Note: The benchmarked kernels include Dave's revert of commit
>>>> 6035152d8eeb ("x86/mm/tlb: Open-code on_each_cpu_cond_mask() for
>>>> tlb_is_not_lazy()
>>>> ---
>>>> arch/x86/mm/tlb.c | 18 +++++++++++++++++-
>>>> 1 file changed, 17 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>>>> index d400b6d9d246..d9314cc8b81f 100644
>>>> --- a/arch/x86/mm/tlb.c
>>>> +++ b/arch/x86/mm/tlb.c
>>>> @@ -734,10 +734,10 @@ static void flush_tlb_func(void *info)
>>>> const struct flush_tlb_info *f = info;
>>>> struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
>>>> u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
>>>> - u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
>>>> u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
>>>> bool local = smp_processor_id() == f->initiating_cpu;
>>>> unsigned long nr_invalidate = 0;
>>>> + u64 mm_tlb_gen;
>>>>
>>>> /* This code cannot presently handle being reentered. */
>>>> VM_WARN_ON(!irqs_disabled());
>>>> @@ -771,6 +771,22 @@ static void flush_tlb_func(void *info)
>>>> return;
>>>> }
>>>>
>>>> + if (f->new_tlb_gen <= local_tlb_gen) {
>>>> + /*
>>>> + * The TLB is already up to date in respect to f->new_tlb_gen.
>>>> + * While the core might be still behind mm_tlb_gen, checking
>>>> + * mm_tlb_gen unnecessarily would have negative caching effects
>>>> + * so avoid it.
>>>> + */
>>>> + return;
>>>> + }
>>>> +
>>>> + /*
>>>> + * Defer mm_tlb_gen reading as long as possible to avoid cache
>>>> + * contention.
>>>> + */
>>>> + mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
>>>> +
>>>> if (unlikely(local_tlb_gen == mm_tlb_gen)) {
>>>> /*
>>>> * There's nothing to do: we're already up to date. This can
>>>> --
>>>> 2.25.1
>>>
>>> I'm sorry, but bisection and reversion show that this commit,
>>> aa44284960d550eb4d8614afdffebc68a432a9b4 in current linux-next,
>>> is responsible for the "internal compiler error: Segmentation fault"s
>>> I get when running kernel builds on tmpfs in 1G memory, lots of swapping.
>>>
>>> That tmpfs is using huge pages as much as it can, so splitting and
>>> collapsing, compaction and page migration entailed, in case that's
>>> relevant (maybe this commit is perfect, but there's a TLB flushing
>>> bug over there in mm which this commit just exposes).
>>>
>>> Whether those segfaults happen without the huge page element,
>>> I have not done enough testing to tell - there are other bugs with
>>> swapping in current linux-next, indeed, I wouldn't even have found
>>> this one, if I hadn't already been on a bisection for another bug,
>>> and got thrown off course by these segfaults.
>>>
>>> I hope that you can work out what might be wrong with this,
>>> but meantime I think it needs to be reverted.
>>
>> I find it always surprising how trivial one liners fail.
>>
>> As you probably know, debugging these kind of things is hard. I see two
>> possible cases:
>>
>> 1. The failure is directly related to this optimization. The immediate
>> suspect in my mind is something to do with PCID/ASID.
>>
>> 2. The failure is due to another bug that was papered by “enough” TLB
>> flushes.
>>
>> I will look into the code. But if it is possible, it would be helpful to
>> know whether you get the failure with the “nopcid” kernel parameter. If it
>> passes, it wouldn’t say much, but if it fails, I think (2) is more likely.
>>
>> Not arguing about a revert, but, in some way, if the test fails, it can
>> indicate that the optimization “works”…
>>
>> I’ll put some time to look deeper into the code, but it would be very
>> helpful if you can let me know what happens with nopcid.
>
> Actually, only using “nopcid” would most likely make it go away if we have
> PTI enabled. So to get a good indication, a check whether it reproduces with
> “nopti” and “nopcid” is needed.
>
> I don’t have a better answer yet. Still trying to see what might have gone
> wrong.

Ok. My bad. Sorry. arch_tlbbatch_flush() does not set any generation in
flush_tlb_info. Bad.

Should be fixed by something like - I’ll send a patch tomorrow:

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d9314cc8b81f..9f19894c322f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -771,7 +771,7 @@ static void flush_tlb_func(void *info)
return;
}

- if (f->new_tlb_gen <= local_tlb_gen) {
+ if (unlikely(f->mm && f->new_tlb_gen <= local_tlb_gen)) {
/*
* The TLB is already up to date in respect to f->new_tlb_gen.
* While the core might be still behind mm_tlb_gen, checking