Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option
From: Peter Zijlstra
Date: Thu Jul 27 2017 - 04:30:39 EST
On Wed, Jul 26, 2017 at 08:41:10AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 26, 2017 at 09:41:28AM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 25, 2017 at 04:59:36PM -0700, Paul E. McKenney wrote:
> > Sure, but SCHED_OTHER auto throttles in that if there's anything else to
> > run, you get to wait. So you can't generate an IPI storm with it. Also,
> > again, we can be limited to a subset of CPUs.
>
> OK, what is its auto-throttle policy? One round of IPIs per jiffy or
> some such?
No. Its called wakeup latency :-) Your SCHED_OTHER task will not get to
insta-run all the time. If there are other tasks already running, we'll
not IPI unless it should preempt.
If its idle, nobody cares..
> Does this auto-throttling also apply if the user is running a CPU-bound
> SCHED_BATCH or SCHED_IDLE task on each CPU, and periodically waking up
> one of a large group of SCHED_OTHER tasks, where the SCHED_OTHER tasks
> immediately sleep upon being awakened?
SCHED_BATCH is even more likely to suffer wakeup latency since it will
never preempt anything.
> OK, and the rq->curr assignment is in common code, correct? Does this
> allow the IPI-only-requesting-process approach to live entirely within
> common code?
That is the idea.
> The 2010 email thread ended up with sys_membarrier() acquiring the
> runqueue lock for each CPU,
Yes, that's something I'm not happy with. Machine wide banging of that
lock will be a performance no-no.
> because doing otherwise meant adding code to the scheduler fastpath.
And that's obviously another thing I'm not happy with either.
> Don't we still need to do this?
>
> https://marc.info/?l=linux-kernel&m=126341138408407&w=2
> https://marc.info/?l=linux-kernel&m=126349766324224&w=2
I don't know.. those seem focussed on mm_cpumask() and we can't use that
per Will's email.
So I think we need to think anew on this, start from the ground up.
What is missing for this:
static void ipi_mb(void *info)
{
smp_mb(); // IPIs should be serializing but paranoid
}
sys_membarrier()
{
smp_mb(); // because sysenter isn't an unconditional mb
for_each_online_cpu(cpu) {
struct task_struct *p;
rcu_read_lock();
p = task_rcu_dereference(&cpu_curr(cpu));
if (p && p->mm == current->mm)
__set_bit(cpus, cpu);
rcu_read_unlock();
}
on_cpu_cpu_mask(cpus, ipi_mb, NULL, 1); // does local smp_mb() too
}
VS
__schedule()
{
spin_lock(&rq->lock);
smp_mb__after_spinlock(); // really full mb implied
/* lots */
if (likely(prev != next)) {
rq->curr = next;
context_switch() {
switch_mm();
switch_to();
// neither need imply a barrier
spin_unlock(&rq->lock);
}
}
}
So I think we need either switch_mm() or switch_to() to imply a full
barrier for this to work, otherwise we get:
CPU0 CPU1
lock rq->lock
mb
rq->curr = A
unlock rq->lock
lock rq->lock
mb
sys_membarrier()
mb
for_each_online_cpu()
p = A
// no match no IPI
mb
rq->curr = B
unlock rq->lock
And that's bad, because now CPU0 doesn't have an MB happening _after_
sys_membarrier() if B matches.
So without audit, I only know of PPC and Alpha not having a barrier in
either switch_*().
x86 obviously has barriers all over the place, arm has a super duper
heavy barrier in switch_to().
One option might be to resurrect spin_unlock_wait(), although to use
that here is really ugly too, but it would avoid thrashing the
rq->lock.
I think it'd end up having to look like:
rq = cpu_rq(cpu);
again:
rcu_read_lock()
p = task_rcu_dereference(&rq->curr);
if (p) {
raw_spin_unlock_wait(&rq->lock);
q = task_rcu_dereference(&rq->curr);
if (q != p) {
rcu_read_unlock();
goto again;
}
}
...
which is just about as horrible as it looks.