Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option

From: Paul E. McKenney
Date: Thu Jul 27 2017 - 09:08:27 EST


On Thu, Jul 27, 2017 at 10:30:03AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 26, 2017 at 08:41:10AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 26, 2017 at 09:41:28AM +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 25, 2017 at 04:59:36PM -0700, Paul E. McKenney wrote:
>
> > > Sure, but SCHED_OTHER auto throttles in that if there's anything else to
> > > run, you get to wait. So you can't generate an IPI storm with it. Also,
> > > again, we can be limited to a subset of CPUs.
> >
> > OK, what is its auto-throttle policy? One round of IPIs per jiffy or
> > some such?
>
> No. Its called wakeup latency :-) Your SCHED_OTHER task will not get to
> insta-run all the time. If there are other tasks already running, we'll
> not IPI unless it should preempt.
>
> If its idle, nobody cares..

So it does IPI immediately sometimes.

> > Does this auto-throttling also apply if the user is running a CPU-bound
> > SCHED_BATCH or SCHED_IDLE task on each CPU, and periodically waking up
> > one of a large group of SCHED_OTHER tasks, where the SCHED_OTHER tasks
> > immediately sleep upon being awakened?
>
> SCHED_BATCH is even more likely to suffer wakeup latency since it will
> never preempt anything.

Ahem. In this scenario, SCHED_BATCH is already running on a the CPU in
question, and a SCHED_OTHER task is awakened from some other CPU.

Do we IPI in that case.

> > OK, and the rq->curr assignment is in common code, correct? Does this
> > allow the IPI-only-requesting-process approach to live entirely within
> > common code?
>
> That is the idea.
>
> > The 2010 email thread ended up with sys_membarrier() acquiring the
> > runqueue lock for each CPU,
>
> Yes, that's something I'm not happy with. Machine wide banging of that
> lock will be a performance no-no.

Regardless of whether or not we need to acquire runqueue locks, IPI,
or whatever other distasteful operation, we should also be able to
throttle and batch operations to at least some extent.

> > because doing otherwise meant adding code to the scheduler fastpath.
>
> And that's obviously another thing I'm not happy with either.

Nor should you or anyone be.

> > Don't we still need to do this?
> >
> > https://marc.info/?l=linux-kernel&m=126341138408407&w=2
> > https://marc.info/?l=linux-kernel&m=126349766324224&w=2
>
> I don't know.. those seem focussed on mm_cpumask() and we can't use that
> per Will's email.
>
> So I think we need to think anew on this, start from the ground up.

Probably from several points in the ground, but OK...

> What is missing for this:
>
> static void ipi_mb(void *info)
> {
> smp_mb(); // IPIs should be serializing but paranoid
> }
>
>
> sys_membarrier()
> {
> smp_mb(); // because sysenter isn't an unconditional mb
>
> for_each_online_cpu(cpu) {
> struct task_struct *p;
>
> rcu_read_lock();
> p = task_rcu_dereference(&cpu_curr(cpu));
> if (p && p->mm == current->mm)
> __set_bit(cpus, cpu);
> rcu_read_unlock();
> }
>
> on_cpu_cpu_mask(cpus, ipi_mb, NULL, 1); // does local smp_mb() too
> }
>
> VS
>
> __schedule()
> {
> spin_lock(&rq->lock);
> smp_mb__after_spinlock(); // really full mb implied
>
> /* lots */
>
> if (likely(prev != next)) {
>
> rq->curr = next;
>
> context_switch() {
> switch_mm();
> switch_to();
> // neither need imply a barrier
>
> spin_unlock(&rq->lock);
> }
> }
> }
>
>
>
>
> So I think we need either switch_mm() or switch_to() to imply a full
> barrier for this to work, otherwise we get:
>
> CPU0 CPU1
>
>
> lock rq->lock
> mb
>
> rq->curr = A
>
> unlock rq->lock
>
> lock rq->lock
> mb
>
> sys_membarrier()
>
> mb
>
> for_each_online_cpu()
> p = A
> // no match no IPI
>
> mb
> rq->curr = B
>
> unlock rq->lock
>
>
> And that's bad, because now CPU0 doesn't have an MB happening _after_
> sys_membarrier() if B matches.

Yes, this looks somewhat similar to the scenario that Mathieu pointed out
back in 2010: https://marc.info/?l=linux-kernel&m=126349766324224&w=2

> So without audit, I only know of PPC and Alpha not having a barrier in
> either switch_*().
>
> x86 obviously has barriers all over the place, arm has a super duper
> heavy barrier in switch_to().

Agreed, if we are going to rely on ->mm, we need ordering on assignment
to it.

> One option might be to resurrect spin_unlock_wait(), although to use
> that here is really ugly too, but it would avoid thrashing the
> rq->lock.
>
> I think it'd end up having to look like:
>
> rq = cpu_rq(cpu);
> again:
> rcu_read_lock()
> p = task_rcu_dereference(&rq->curr);
> if (p) {
> raw_spin_unlock_wait(&rq->lock);
> q = task_rcu_dereference(&rq->curr);
> if (q != p) {
> rcu_read_unlock();
> goto again;
> }
> }
> ...
>
> which is just about as horrible as it looks.

It does indeed look a bit suboptimal.

Thanx, Paul