Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm

From: Rik van Riel
Date: Fri Jun 01 2018 - 15:43:38 EST


On Fri, 2018-06-01 at 20:48 +0200, Mike Galbraith wrote:
> On Fri, 2018-06-01 at 14:22 -0400, Rik van Riel wrote:
> > On Fri, 2018-06-01 at 08:11 -0700, Andy Lutomirski wrote:
> > > On Fri, Jun 1, 2018 at 5:28 AM Rik van Riel <riel@xxxxxxxxxxx>
> > > wrote:
> > > >
> > > > Song noticed switch_mm_irqs_off taking a lot of CPU time in
> > > > recent
> > > > kernels,using 2.4% of a 48 CPU system during a netperf to
> > > > localhost
> > > > run.
> > > > Digging into the profile, we noticed that cpumask_clear_cpu and
> > > > cpumask_set_cpu together take about half of the CPU time taken
> > > > by
> > > > switch_mm_irqs_off.
> > > >
> > > > However, the CPUs running netperf end up switching back and
> > > > forth
> > > > between netperf and the idle task, which does not require
> > > > changes
> > > > to the mm_cpumask. Furthermore, the init_mm cpumask ends up
> > > > being
> > > > the most heavily contended one in the system.`
> > > >
> > > > Skipping cpumask_clear_cpu and cpumask_set_cpu for init_mm
> > > > (mostly the idle task) reduced CPU use of switch_mm_irqs_off
> > > > from 2.4% of the CPU to 1.9% of the CPU, with the following
> > > > netperf commandline:
> > >
> > > I'm conceptually fine with this change. Does
> > > mm_cpumask(&init_mm)
> > > end
> > > up in a deterministic state?
> >
> > Given that we do not touch mm_cpumask(&init_mm)
> > any more, and that bitmask never appears to be
> > used for things like tlb shootdowns (kernel TLB
> > shootdowns simply go to everybody), I suspect
> > it ends up in whatever state it is initialized
> > to on startup.
> >
> > I had not looked into this much, because it does
> > not appear to be used for anything.
> >
> > > Mike, depending on exactly what's going on with your benchmark,
> > > this
> > > might help recover a bit of your performance, too.
> >
> > It will be interesting to know how this change
> > impacts others.
>
> previous pipe-test numbers
> 4.13.16 2.024978 usecs/loop -- avg 2.045250 977.9 KHz
> 4.14.47 2.234518 usecs/loop -- avg 2.227716 897.8 KHz
> 4.15.18 2.287815 usecs/loop -- avg 2.295858 871.1 KHz
> 4.16.13 2.286036 usecs/loop -- avg 2.279057 877.6 KHz
> 4.17.0.g88a8676 2.288231 usecs/loop -- avg 2.288917 873.8 KHz
>
> new numbers
> 4.17.0.g0512e01 2.268629 usecs/loop -- avg 2.269493 881.3 KHz
> 4.17.0.g0512e01 2.035401 usecs/loop -- avg 2.038341 981.2 KHz +andy
> 4.17.0.g0512e01 2.238701 usecs/loop -- avg 2.231828 896.1 KHz
> -andy+rik
>
> There might be something there with your change Rik, but it's small
> enough to be wary of variance. Andy's "invert the return of
> tlb_defer_switch_to_init_mm()" is OTOH pretty clear.

If inverting the return value of that function helps
some systems, chances are the other value might help
other systems.

That makes you wonder whether it might make sense
to always switch to lazy TLB mode, and only call
switch_mm at TLB flush time, regardless of whether
the CPU supports PCID...

--
All Rights Reversed.

Attachment: signature.asc
Description: This is a digitally signed message part