Re: [RFC PATCH v2] membarrier: expedited private command

From: Peter Zijlstra
Date: Tue Aug 01 2017 - 06:22:19 EST


On Tue, Aug 01, 2017 at 07:57:17PM +1000, Nicholas Piggin wrote:
> On Tue, 1 Aug 2017 10:12:30 +0200
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> > On Tue, Aug 01, 2017 at 12:00:47PM +1000, Nicholas Piggin wrote:
> > > Thanks for this, I'll take a look. This should be a good start as a stress
> > > test, but I'd also be interested in some application. The reason being that
> > > for example using runqueue locks may give reasonable maximum throughput
> > > numbers, but could cause some latency or slowdown when it's used in more
> > > realistic scenario.
> >
> > Given this is an unprivileged interface we have to consider DoS and
> > other such lovely things. And since we cannot use mm_cpumask() we're
> > stuck with for_each_online_cpu().
>
> I think we *can* make that part of it per-arch, as well as whether
> or not to use runqueue locks. It's kind of crazy not to use it when
> it's available. Avoiding CPUs you aren't allowed to run on is also
> nice for compartmentalization.

Right, but a wee bit hard to do since most of the affinity stuff is per
task, not per process. And that would of course only help with work that
is hard partitioned.

> > Combined that means that using rq->lock is completely out of the
> > question, some numbnut doing 'for (;;) sys_membarrier();' can
> > completely wreck the system.
>
> In what way would it wreck the system? It's not holding the lock over
> the IPI, only to inspect the rq->curr->mm briefly.

But you're bouncing the rq->lock around the system at fairly high rates.
For big enough systems this is enough to severely hurt things.

> > Yes, it might work for 'normal' workloads, but the interference
> > potential is just too big.
>
> Well it's good to be concerned about it. I do see your point. Although
> I don't know if it's all that complicated to use unprivileged ops to
> badly hurt QoS on most systems already :)

If so, we should look at fixing that, not make it worse.

> If mm cpumask is used, I think it's okay. You can cause quite similar
> kind of iteration over CPUs and lots of IPIs, tlb flushes, etc using
> munmap/mprotect/etc, or context switch IPIs, etc. Are we reaching the
> stage where we're controlling those kinds of ops in terms of impact
> to the rest of the system?

So x86 has a tight mm_cpumask(), we only broadcast TLB invalidate IPIs
to those CPUs actually running threads of our process (or very
recently). So while there can be the sporadic stray IPI for a CPU that
recently ran a thread of the target process, it will not get another one
until it switches back into the process.

On machines that need manual TLB broadcasts and don't keep a tight mask,
yes you can interfere at will, but if they care they can fix by
tightening the mask.

In either case, the mm_cpumask() will be bounded by the set of CPUs the
threads are allowed to run on and will not interfere with the rest of
the system.

As to scheduler IPIs, those are limited to the CPUs the user is limited
to and are rate limited by the wakeup-latency of the tasks. After all,
all the time a task is runnable but not running, wakeups are no-ops.

Trouble is of course, that not everybody even sets a single bit in
mm_cpumask() and those that never clear bits will end up with a fairly
wide mask, still interfering with work that isn't hard partitioned.