Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm

From: Song Liu
Date: Sat Jun 02 2018 - 20:52:36 EST




> On Jun 2, 2018, at 1:14 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> On Fri, Jun 1, 2018 at 10:04 PM Rik van Riel <riel@xxxxxxxxxxx> wrote:
>>
>> On Fri, 2018-06-01 at 20:35 -0700, Andy Lutomirski wrote:
>>> On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@xxxxxxxxxxx> wrote:
>>>>
>>>> On Fri, 1 Jun 2018 14:21:58 -0700
>>>> Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>>>
>>>>> Hmm. I wonder if there's a more clever data structure than a
>>>>> bitmap
>>>>> that we could be using here. Each CPU only ever needs to be in
>>>>> one
>>>>> mm's cpumask, and each cpu only ever changes its own state in the
>>>>> bitmask. And writes are much less common than reads for most
>>>>> workloads.
>>>>
>>>> It would be easy enough to add an mm_struct pointer to the
>>>> per-cpu tlbstate struct, and iterate over those.
>>>>
>>>> However, that would be an orthogonal change to optimizing
>>>> lazy TLB mode.
>>>>
>>>> Does the (untested) patch below make sense as a potential
>>>> improvement to the lazy TLB heuristic?
>>>>
>>>> ---8<---
>>>> Subject: x86,tlb: workload dependent per CPU lazy TLB switch
>>>>
>>>> Lazy TLB mode is a tradeoff between flushing the TLB and touching
>>>> the mm_cpumask(&init_mm) at context switch time, versus potentially
>>>> incurring a remote TLB flush IPI while in lazy TLB mode.
>>>>
>>>> Whether this pays off is likely to be workload dependent more than
>>>> anything else. However, the current heuristic keys off hardware
>>>> type.
>>>>
>>>> This patch changes the lazy TLB mode heuristic to a dynamic, per-
>>>> CPU
>>>> decision, dependent on whether we recently received a remote TLB
>>>> shootdown while in lazy TLB mode.
>>>>
>>>> This is a very simple heuristic. When a CPU receives a remote TLB
>>>> shootdown IPI while in lazy TLB mode, a counter in the same cache
>>>> line is set to 16. Every time we skip lazy TLB mode, the counter
>>>> is decremented.
>>>>
>>>> While the counter is zero (no recent TLB flush IPIs), allow lazy
>>>> TLB mode.
>>>
>>> Hmm, cute. That's not a bad idea at all. It would be nice to get
>>> some kind of real benchmark on both PCID and !PCID. If nothing else,
>>> I would expect the threshold (16 in your patch) to want to be lower
>>> on
>>> PCID systems.
>>
>> That depends on how well we manage to get rid of
>> the cpumask manipulation overhead. On the PCID
>> system we first found this issue, the atomic
>> accesses to the mm_cpumask took about 4x as much
>> CPU time as the TLB invalidation itself.
>>
>> That kinda limits how much the cost of cheaper
>> TLB flushes actually help :)
>>
>> I agree this code should get some testing.
>>
>
> Just to check: in the workload where you're seeing this problem, are
> you using an mm with many threads? I would imagine that, if you only
> have one or two threads, the bit operations aren't so bad.

Yes, we are running netperf/netserver with 300 threads. We don't see
this much overhead in with real workload.

Here are some test results. The tests are on 2-socket system with E5-2678 v3,
so 48 logical cores with PCID. I tested 3 kernel (and tunes
flushed_while_lazy):

Baseline: 0512e01345 in upstream
Baseline + lazy-tlb: 0512e01345 + "x86,tlb: workload dependent per CPU lazy TLB switch"
Baseline + both patches: 0512e01345 + "x86,tlb: workload dependent per CPU lazy TLB switch" + "x86,switch_mm: skip atomic operations for init_mm"

The tests are wrong with the following options:

./super_netperf 300 -P 0 -t TCP_RR -p 8888 -H <target_host> -l 60 -- -r 100,100 -o -s 1M,1M -S 1M,1M

There are the results:
flushed_while_lazy Throughput %-cpu-on-switch_mm_irqs_off()

Baseline N/A 2.02756e+06 2.82% (1.54% in ctx of netserver + 1.28% in ctx of swapper)
Baseline + lazy-tlb 16 1.92706e+06 1.19% (only in ctx of swapper, same for all cases below)
Baseline + lazy-tlb 4 2.03863e+06 1.12% Good option 1
Baseline + lazy-tlb 2 1.93847e+06 1.22%
Baseline + both patches 64 1.92219e+06 1.09%
Baseline + both patches 16 1.93597e+06 1.15%
Baseline + both patches 8 1.92175e+06 1.11%
Baseline + both patches 4 2.03465e+06 1.09% Good option 2
Baseline + both patches 2 2.03603e+06 1.10% Good option 3
Baseline + both patches 1 1.9241e+06 1.06%

I highlighted 3 good options above. They are about the same for this
test case.

Thanks,
Song