Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm

From: Rik van Riel
Date: Sat Jun 02 2018 - 01:04:13 EST


On Fri, 2018-06-01 at 20:35 -0700, Andy Lutomirski wrote:
> On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@xxxxxxxxxxx> wrote:
> >
> > On Fri, 1 Jun 2018 14:21:58 -0700
> > Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> >
> > > Hmm. I wonder if there's a more clever data structure than a
> > > bitmap
> > > that we could be using here. Each CPU only ever needs to be in
> > > one
> > > mm's cpumask, and each cpu only ever changes its own state in the
> > > bitmask. And writes are much less common than reads for most
> > > workloads.
> >
> > It would be easy enough to add an mm_struct pointer to the
> > per-cpu tlbstate struct, and iterate over those.
> >
> > However, that would be an orthogonal change to optimizing
> > lazy TLB mode.
> >
> > Does the (untested) patch below make sense as a potential
> > improvement to the lazy TLB heuristic?
> >
> > ---8<---
> > Subject: x86,tlb: workload dependent per CPU lazy TLB switch
> >
> > Lazy TLB mode is a tradeoff between flushing the TLB and touching
> > the mm_cpumask(&init_mm) at context switch time, versus potentially
> > incurring a remote TLB flush IPI while in lazy TLB mode.
> >
> > Whether this pays off is likely to be workload dependent more than
> > anything else. However, the current heuristic keys off hardware
> > type.
> >
> > This patch changes the lazy TLB mode heuristic to a dynamic, per-
> > CPU
> > decision, dependent on whether we recently received a remote TLB
> > shootdown while in lazy TLB mode.
> >
> > This is a very simple heuristic. When a CPU receives a remote TLB
> > shootdown IPI while in lazy TLB mode, a counter in the same cache
> > line is set to 16. Every time we skip lazy TLB mode, the counter
> > is decremented.
> >
> > While the counter is zero (no recent TLB flush IPIs), allow lazy
> > TLB mode.
>
> Hmm, cute. That's not a bad idea at all. It would be nice to get
> some kind of real benchmark on both PCID and !PCID. If nothing else,
> I would expect the threshold (16 in your patch) to want to be lower
> on
> PCID systems.

That depends on how well we manage to get rid of
the cpumask manipulation overhead. On the PCID
system we first found this issue, the atomic
accesses to the mm_cpumask took about 4x as much
CPU time as the TLB invalidation itself.

That kinda limits how much the cost of cheaper
TLB flushes actually help :)

I agree this code should get some testing.

--
All Rights Reversed.

Attachment: signature.asc
Description: This is a digitally signed message part