Re: [PATCH] sched/rt: Avoid sending an IPI to a CPU already doing a push

From: Steven Rostedt
Date: Fri Jul 08 2016 - 11:12:28 EST


On Fri, 8 Jul 2016 16:51:53 +0200
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Fri, Jun 24, 2016 at 11:26:13AM -0400, Steven Rostedt wrote:
> > The IPI code now needs to call push_tasks() instead of just push_task() as
> > it will not be receiving an IPI for each CPU that is requesting a PULL.
>
> My brain just skidded on that, can you try again with a few more words?

Sure. One of those cases where I lose track of knowing what I know, and
don't describe enough.

The original logic expected each pull request to have its own IPI sent
out. Thus, a CPU would only do a single push, knowing that only one CPU
opened up (lowered its priority). Although, it still sends another IPI
out to the next CPU regardless of a push if a higher task is waiting.
This may be solved with a cpupri of next highest waiters. But that's a
bit complex for now.

Anyway, the old code would have multiple CPUs sending IPIs out to CPUs.
One for each CPU that lowers its prio. The new code checks if a CPU is
already processing an IPI (via the ipi counter) and will skip sending
to that CPU. That means that if two CPUs lower its priority, and only
one CPU has multiple RT tasks waiting, it needs to do multiple
push_task() calls (which push_tasks() does).

Better? I can update the change log.

>
>
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d5690b722691..165bcfdbd94b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -100,6 +100,7 @@ void init_rt_rq(struct rt_rq *rt_rq)
> > rt_rq->push_flags = 0;
> > rt_rq->push_cpu = nr_cpu_ids;
> > raw_spin_lock_init(&rt_rq->push_lock);
> > + atomic_set(&rt_rq->ipi_count, 0);
> > init_irq_work(&rt_rq->push_work, push_irq_work_func);
> > #endif
> > #endif /* CONFIG_SMP */
> > @@ -1917,6 +1918,10 @@ static int find_next_push_cpu(struct rq *rq)
> > break;
> > next_rq = cpu_rq(cpu);
> >
> > + /* If pushing was already started, ignore */
> > + if (atomic_read(&next_rq->rt.ipi_count))
> > + continue;
> > +
> > /* Make sure the next rq can push to this rq */
> > if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
> > break;
> > @@ -1955,6 +1960,7 @@ static void tell_cpu_to_push(struct rq *rq)
> > return;
> >
> > rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
> > + atomic_inc(&cpu_rq(cpu)->rt.ipi_count);
> >
> > irq_work_queue_on(&rq->rt.push_work, cpu);
> > }
> > @@ -1974,11 +1980,12 @@ static void try_to_push_tasks(void *arg)
> >
> > rq = cpu_rq(this_cpu);
> > src_rq = rq_of_rt_rq(rt_rq);
> > + WARN_ON_ONCE(!atomic_read(&rq->rt.ipi_count));
> >
> > again:
> > if (has_pushable_tasks(rq)) {
> > raw_spin_lock(&rq->lock);
> > - push_rt_task(rq);
> > + push_rt_tasks(rq);
>
> Maybe as a comment around here?

Sure.

>
> > raw_spin_unlock(&rq->lock);
> > }
> >
> > @@ -2000,7 +2007,7 @@ again:
> > raw_spin_unlock(&rt_rq->push_lock);
> >
> > if (cpu >= nr_cpu_ids)
> > - return;
> > + goto out;
> >
> > /*
> > * It is possible that a restart caused this CPU to be
> > @@ -2011,7 +2018,10 @@ again:
> > goto again;
> >
> > /* Try the next RT overloaded CPU */
> > + atomic_inc(&cpu_rq(cpu)->rt.ipi_count);
> > irq_work_queue_on(&rt_rq->push_work, cpu);
> > +out:
> > + atomic_dec(&rq->rt.ipi_count);
> > }
>
> I have a vague feeling we're duplicating state, but I can't seem to spot
> it, maybe I'm wrong.

The above is can be deceiving because there's two atomics close, but
subtly different.

The inc updates curr_rq(cpu) the dec updates just "rq".

>
> Looks about right, but could use a comment, this stuff is getting rather
> subtle.

Agreed. I'll add that and update it. Again, this patch added to 4.6-rt
makes it pass rteval with a 200us max on that 80 CPU box.

Thanks for looking into this.

-- Steve