Re: [x86/mm/tlb] 6035152d8e: will-it-scale.per_thread_ops -13.2% regression

From: Nadav Amit
Date: Thu Mar 17 2022 - 16:32:42 EST




> On Mar 17, 2022, at 12:11 PM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 3/17/22 12:02, Nadav Amit wrote:
>>> This new "early lazy check" behavior could theoretically work both ways.
>>> If threads tended to be waking up from idle when TLB flushes were being
>>> sent, this would tend to reduce the number of IPIs. But, since they
>>> tend to be going to sleep it increases the number of IPIs.
>>>
>>> Anybody have a better theory? I think we should probably revert the commit.
>>
>> Let’s get back to the motivation behind this patch.
>>
>> Originally we had an indirect branch that on system which are
>> vulnerable to Spectre v2 translates into a retpoline.
>>
>> So I would not paraphrase this patch purpose as “early lazy check”
>> but instead “more efficient lazy check”. There is very little code
>> that was executed between the call to on_each_cpu_cond_mask() and
>> the actual check of tlb_is_not_lazy(). So what it seems to happen
>> in this test-case - according to what you say - is that *slower*
>> checks of is-lazy allows to send fewer IPIs since some cores go
>> into idle-state.
>>
>> Was this test run with retpolines? If there is a difference in
>> performance without retpoline - I am probably wrong.
>
> Nope, no retpolines:

Err..

>
>> /sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Enhanced IBRS, IBPB: conditional, RSB filling
>
> which is the same situation as the "Xeon Platinum 8358" which found this
> in 0day.
>
> Maybe the increased IPIs with this approach end up being a wash with the
> reduced retpoline overhead.
>
> Did you have any specific performance numbers that show the benefit on
> retpoline systems?

I had profiled this thing to death at the time. I don’t have the numbers
with me now though. I did not run will-it-scale but a similar benchmark
that I wrote.

Another possible reason is that perhaps with this patch alone, without
subsequent patches we get some negative impact. I do not have a good
explanation, but can we rule this one out?

Can you please clarify how the bot works - did it notice a performance
regression and then started bisecting, or did it just check one patch
at a time?

I ask because I got a different report from the report that a
subsequent patch ("x86/mm/tlb: Privatize cpu_tlbstate”) made a
23.3% improvement [1] for a very similar (yet different) test.

Without a good explanation, my knee-jerk reaction is that this seems
as a pathological case. I do not expect performance improvement without
retpolines, and perhaps the few cycles in which the test of is-lazy
is performed earlier matter.

I’m not married to this patch, but before a revert it would be good
to know why it even matters. I wonder whether you can confirm that
reverting the patch (without the rest of the series) even helps. If
it does, I’ll try to run some tests to understand what the heck is
going on.

[1] https://lists.ofono.org/hyperkitty/list/lkp@xxxxxxxxxxxx/thread/UTC7DVZX4O5DKT2WUTWBTCVQ6W5QLGFA/