Re: [RFC PATCH v2] membarrier: expedited private command

From: Peter Zijlstra
Date: Tue Aug 01 2017 - 07:02:56 EST


On Tue, Aug 01, 2017 at 08:39:28PM +1000, Nicholas Piggin wrote:
> On Tue, 1 Aug 2017 12:22:03 +0200
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> > But you're bouncing the rq->lock around the system at fairly high rates.
> > For big enough systems this is enough to severely hurt things.
>
> If you already have scheduling access on those CPUs to set mm cpumask,
> then you can do worse I guess.

Ah, so currently, because of ARM64, we have to use for_each_online_cpu()
and then we're not limited to any affinity mask and will wreck across
partitioning.

If ARM64 starts setting bits in mm_cpumask().. then yes.

> > > > Yes, it might work for 'normal' workloads, but the interference
> > > > potential is just too big.
> > >
> > > Well it's good to be concerned about it. I do see your point. Although
> > > I don't know if it's all that complicated to use unprivileged ops to
> > > badly hurt QoS on most systems already :)
> >
> > If so, we should look at fixing that, not make it worse.
>
> Point is I don't think this *is* worse. It's not too difficult to acquire
> per-cpu locks and cachelines and IPIs to other CPUs if you are a malicious
> DoS. Particularly not when you can scheudle tasks around those CPUs.

So we try and not take remote rq->locks for wakeups when those locks are
outside of the cache domain.

> > In either case, the mm_cpumask() will be bounded by the set of CPUs the
> > threads are allowed to run on and will not interfere with the rest of
> > the system.
>
> Yep. Powerpc at least will use its mm_cpumask for iterating here
> (regardless of how exactly we get the ordering right). Even though
> at the moment it is a lazy mask, it's better than a global walk.

Agreed.

> > As to scheduler IPIs, those are limited to the CPUs the user is limited
> > to and are rate limited by the wakeup-latency of the tasks. After all,
> > all the time a task is runnable but not running, wakeups are no-ops.
>
> That's on the order of millions per second per core though, isn't it?

Thousands, assuming another process is actually running. We'll not
always win the wakeup-preempt race. And once we're queued, no more
wakeups until we get to run.

> > Trouble is of course, that not everybody even sets a single bit in
> > mm_cpumask() and those that never clear bits will end up with a fairly
> > wide mask, still interfering with work that isn't hard partitioned.
>
> Right, I just don't see what real problem this opens up that you don't
> already have when you are not hard partitioned, therefore it doesn't
> make sense to add a slowdown to the context switch fastpath to close
> one hole in the sieve.
>
> Completely recognizing that other architectures can do it without
> taking rq lock at all and will not be forced to do so.

If we can limit this to hard partitioned, that would be good indeed.

I'm just trying to avoid having two implementation of this thing. At the
same time I very much understand your reluctance to add this barrier.

In any case, supposing we can do that intent thing. How horrible would
something like:


context_switch()
if (unlikely(mm->needs_barrier))
smp_mb__after_unlock_lock();


be? We only need the extra barrier when we switch _into_ mm's that care
about sys_membarrier() in the first place. At which point they pay the
price.