Re: [RFC PATCH v2] membarrier: expedited private command

From: Nicholas Piggin
Date: Tue Aug 01 2017 - 06:39:54 EST


On Tue, 1 Aug 2017 12:22:03 +0200
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Tue, Aug 01, 2017 at 07:57:17PM +1000, Nicholas Piggin wrote:
> > On Tue, 1 Aug 2017 10:12:30 +0200
> > Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > > On Tue, Aug 01, 2017 at 12:00:47PM +1000, Nicholas Piggin wrote:
> > > > Thanks for this, I'll take a look. This should be a good start as a stress
> > > > test, but I'd also be interested in some application. The reason being that
> > > > for example using runqueue locks may give reasonable maximum throughput
> > > > numbers, but could cause some latency or slowdown when it's used in more
> > > > realistic scenario.
> > >
> > > Given this is an unprivileged interface we have to consider DoS and
> > > other such lovely things. And since we cannot use mm_cpumask() we're
> > > stuck with for_each_online_cpu().
> >
> > I think we *can* make that part of it per-arch, as well as whether
> > or not to use runqueue locks. It's kind of crazy not to use it when
> > it's available. Avoiding CPUs you aren't allowed to run on is also
> > nice for compartmentalization.
>
> Right, but a wee bit hard to do since most of the affinity stuff is per
> task, not per process. And that would of course only help with work that
> is hard partitioned.
>
> > > Combined that means that using rq->lock is completely out of the
> > > question, some numbnut doing 'for (;;) sys_membarrier();' can
> > > completely wreck the system.
> >
> > In what way would it wreck the system? It's not holding the lock over
> > the IPI, only to inspect the rq->curr->mm briefly.
>
> But you're bouncing the rq->lock around the system at fairly high rates.
> For big enough systems this is enough to severely hurt things.

If you already have scheduling access on those CPUs to set mm cpumask,
then you can do worse I guess.

>
> > > Yes, it might work for 'normal' workloads, but the interference
> > > potential is just too big.
> >
> > Well it's good to be concerned about it. I do see your point. Although
> > I don't know if it's all that complicated to use unprivileged ops to
> > badly hurt QoS on most systems already :)
>
> If so, we should look at fixing that, not make it worse.

Point is I don't think this *is* worse. It's not too difficult to acquire
per-cpu locks and cachelines and IPIs to other CPUs if you are a malicious
DoS. Particularly not when you can scheudle tasks around those CPUs.

> > If mm cpumask is used, I think it's okay. You can cause quite similar
> > kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using
> > munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the
> > stage where we're controlling those kinds of ops in terms of impact
> > to the rest of the system?
>
> So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs
> to those CPUs actually running threads of our process (or very
> recently). So while there can be the sporadic stray IPI for a CPU that
> recently ran a thread of the target process, it will not get another one
> until it switches back into the process.
>
> On machines that need manual TLB broadcasts and don't keep a tight mask,
> yes you can interfere at will, but if they care they can fix by
> tightening the mask.
>
> In either case, the mm_cpumask() will be bounded by the set of CPUs the
> threads are allowed to run on and will not interfere with the rest of
> the system.

Yep. Powerpc at least will use its mm_cpumask for iterating here
(regardless of how exactly we get the ordering right). Even though
at the moment it is a lazy mask, it's better than a global walk.

> As to scheduler IPIs, those are limited to the CPUs the user is limited
> to and are rate limited by the wakeup-latency of the tasks. After all,
> all the time a task is runnable but not running, wakeups are no-ops.

That's on the order of millions per second per core though, isn't it?

>
> Trouble is of course, that not everybody even sets a single bit in
> mm_cpumask() and those that never clear bits will end up with a fairly
> wide mask, still interfering with work that isn't hard partitioned.

Right, I just don't see what real problem this opens up that you don't
already have when you are not hard partitioned, therefore it doesn't
make sense to add a slowdown to the context switch fastpath to close
one hole in the sieve.

Completely recognizing that other architectures can do it without
taking rq lock at all and will not be forced to do so.