Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm
From: Andy Lutomirski
Date: Sat Jun 02 2018 - 16:14:58 EST
On Fri, Jun 1, 2018 at 10:04 PM Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
> On Fri, 2018-06-01 at 20:35 -0700, Andy Lutomirski wrote:
> > On Fri, Jun 1, 2018 at 3:13 PM Rik van Riel <riel@xxxxxxxxxxx> wrote:
> > >
> > > On Fri, 1 Jun 2018 14:21:58 -0700
> > > Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> > >
> > > > Hmm. I wonder if there's a more clever data structure than a
> > > > bitmap
> > > > that we could be using here. Each CPU only ever needs to be in
> > > > one
> > > > mm's cpumask, and each cpu only ever changes its own state in the
> > > > bitmask. And writes are much less common than reads for most
> > > > workloads.
> > >
> > > It would be easy enough to add an mm_struct pointer to the
> > > per-cpu tlbstate struct, and iterate over those.
> > >
> > > However, that would be an orthogonal change to optimizing
> > > lazy TLB mode.
> > >
> > > Does the (untested) patch below make sense as a potential
> > > improvement to the lazy TLB heuristic?
> > >
> > > ---8<---
> > > Subject: x86,tlb: workload dependent per CPU lazy TLB switch
> > >
> > > Lazy TLB mode is a tradeoff between flushing the TLB and touching
> > > the mm_cpumask(&init_mm) at context switch time, versus potentially
> > > incurring a remote TLB flush IPI while in lazy TLB mode.
> > >
> > > Whether this pays off is likely to be workload dependent more than
> > > anything else. However, the current heuristic keys off hardware
> > > type.
> > >
> > > This patch changes the lazy TLB mode heuristic to a dynamic, per-
> > > CPU
> > > decision, dependent on whether we recently received a remote TLB
> > > shootdown while in lazy TLB mode.
> > >
> > > This is a very simple heuristic. When a CPU receives a remote TLB
> > > shootdown IPI while in lazy TLB mode, a counter in the same cache
> > > line is set to 16. Every time we skip lazy TLB mode, the counter
> > > is decremented.
> > >
> > > While the counter is zero (no recent TLB flush IPIs), allow lazy
> > > TLB mode.
> >
> > Hmm, cute. That's not a bad idea at all. It would be nice to get
> > some kind of real benchmark on both PCID and !PCID. If nothing else,
> > I would expect the threshold (16 in your patch) to want to be lower
> > on
> > PCID systems.
>
> That depends on how well we manage to get rid of
> the cpumask manipulation overhead. On the PCID
> system we first found this issue, the atomic
> accesses to the mm_cpumask took about 4x as much
> CPU time as the TLB invalidation itself.
>
> That kinda limits how much the cost of cheaper
> TLB flushes actually help :)
>
> I agree this code should get some testing.
>
Just to check: in the workload where you're seeing this problem, are
you using an mm with many threads? I would imagine that, if you only
have one or two threads, the bit operations aren't so bad.
I wonder if just having a whole cacheline per node for the cpumask
would solve the problem. I don't love the idea of having every flush
operation scan cpu_tlbstate for every single CPU -- we'll end up with
nasty contention on the cpu_tlbstate cache lines on some workloads.