Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option

From: Peter Zijlstra
Date: Thu Jul 27 2017 - 04:30:39 EST


On Wed, Jul 26, 2017 at 08:41:10AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 26, 2017 at 09:41:28AM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 25, 2017 at 04:59:36PM -0700, Paul E. McKenney wrote:

> > Sure, but SCHED_OTHER auto throttles in that if there's anything else to
> > run, you get to wait. So you can't generate an IPI storm with it. Also,
> > again, we can be limited to a subset of CPUs.
>
> OK, what is its auto-throttle policy? One round of IPIs per jiffy or
> some such?

No. Its called wakeup latency :-) Your SCHED_OTHER task will not get to
insta-run all the time. If there are other tasks already running, we'll
not IPI unless it should preempt.

If its idle, nobody cares..

> Does this auto-throttling also apply if the user is running a CPU-bound
> SCHED_BATCH or SCHED_IDLE task on each CPU, and periodically waking up
> one of a large group of SCHED_OTHER tasks, where the SCHED_OTHER tasks
> immediately sleep upon being awakened?

SCHED_BATCH is even more likely to suffer wakeup latency since it will
never preempt anything.

> OK, and the rq->curr assignment is in common code, correct? Does this
> allow the IPI-only-requesting-process approach to live entirely within
> common code?

That is the idea.

> The 2010 email thread ended up with sys_membarrier() acquiring the
> runqueue lock for each CPU,

Yes, that's something I'm not happy with. Machine wide banging of that
lock will be a performance no-no.

> because doing otherwise meant adding code to the scheduler fastpath.

And that's obviously another thing I'm not happy with either.

> Don't we still need to do this?
>
> https://marc.info/?l=linux-kernel&m=126341138408407&w=2
> https://marc.info/?l=linux-kernel&m=126349766324224&w=2

I don't know.. those seem focussed on mm_cpumask() and we can't use that
per Will's email.


So I think we need to think anew on this, start from the ground up.

What is missing for this:

static void ipi_mb(void *info)
{
smp_mb(); // IPIs should be serializing but paranoid
}


sys_membarrier()
{
smp_mb(); // because sysenter isn't an unconditional mb

for_each_online_cpu(cpu) {
struct task_struct *p;

rcu_read_lock();
p = task_rcu_dereference(&cpu_curr(cpu));
if (p && p->mm == current->mm)
__set_bit(cpus, cpu);
rcu_read_unlock();
}

on_cpu_cpu_mask(cpus, ipi_mb, NULL, 1); // does local smp_mb() too
}

VS

__schedule()
{
spin_lock(&rq->lock);
smp_mb__after_spinlock(); // really full mb implied

/* lots */

if (likely(prev != next)) {

rq->curr = next;

context_switch() {
switch_mm();
switch_to();
// neither need imply a barrier

spin_unlock(&rq->lock);
}
}
}




So I think we need either switch_mm() or switch_to() to imply a full
barrier for this to work, otherwise we get:

CPU0 CPU1


lock rq->lock
mb

rq->curr = A

unlock rq->lock

lock rq->lock
mb

sys_membarrier()

mb

for_each_online_cpu()
p = A
// no match no IPI

mb
rq->curr = B

unlock rq->lock


And that's bad, because now CPU0 doesn't have an MB happening _after_
sys_membarrier() if B matches.


So without audit, I only know of PPC and Alpha not having a barrier in
either switch_*().

x86 obviously has barriers all over the place, arm has a super duper
heavy barrier in switch_to().

One option might be to resurrect spin_unlock_wait(), although to use
that here is really ugly too, but it would avoid thrashing the
rq->lock.

I think it'd end up having to look like:

rq = cpu_rq(cpu);
again:
rcu_read_lock()
p = task_rcu_dereference(&rq->curr);
if (p) {
raw_spin_unlock_wait(&rq->lock);
q = task_rcu_dereference(&rq->curr);
if (q != p) {
rcu_read_unlock();
goto again;
}
}
...

which is just about as horrible as it looks.