[PATCH] sched/rt: Avoid sending an IPI to a CPU already doing a push

From: Steven Rostedt
Date: Fri Jun 24 2016 - 11:26:25 EST



When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit b6366f048e0c ("sched/rt: Use IPI to trigger RT task
push migration instead of pulling") changed the way to request a pull.
Instead of grabbing the lock of the overloaded CPU's runqueue, it simply
sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer a large (although not as large as without the IPI)
latency due to the work within the IPI. To understand this issue, an
explanation of the IPI logic is required.

When a CPU lowers its priority, it finds the next set bit in the rto_mask
from its own CPU. That is, if bit 2 and 10 are set in the rto_mask, and CPU
8 lowers its priority, it will select CPU 10 to send its IPI to. Now, lets
say that CPU 0 and CPU 1 lower their prioirty. They will both send their IPI
to CPU 2.

If IPI of CPU 0 gets to CPU 2 first, then it triggers the PUSH logic and if
CPU 1 has a lower priority than CPU 0, it will push its overloaded task to
CPU 1 (due to cpupri), even though the IPI came from CPU 0. Even though a
task was pushed, we need to make sure that there's not higher tasks still
waiting. Thus an IPI is then sent to CPU 10 for processing of CPU 0's
request (remember the pushed task went to CPU 1).

When the IPI of CPU 1 reaches CPU 2, it will skip the push logic (because it
no longer has any tasks to push), but it too still needs to notify other
CPUs about this CPU lowering its priority. Thus it sends another IPI to CPU
10, because that bit is still set in the rto_mask.

Now on CPU 10, it just finished dealing with the IPI of CPU 8, and even
though it now doesn't have any RT tasks to push, it just received two more
IPIs (from CPU 2, to deal with CPU 0 and CPU 1). It too must do work to see
if it should continue sending an IPI to more rto_mask CPUs. If there's no
more CPUs to send to, it still needs to "stop" the execution of the push
request.

Although these IPIs are fast to process, I've traced a single CPU dealing
with 89 IPIs in a row, on a 80 CPU machine! This was caused by an overloaded
RT task that and a limited CPU affinity, and most of the CPUs sending IPIs
to it, couldn't do anything with it. And because the CPUs were very active
and changed their priorities again, it sent out duplicates. The latency of
handling 89 IPIs was 200us (~2.3us to handle each IPI), as each IPI does
require taking of a spinlock that deals with the IPI itself (not a rq lock,
and very little contention).

To solve this, an ipi_count is added to rt_rq, that gets incremented when an
IPI is sent to that runqueue. When looking for the next CPU to process, the
ipi_count is checked to see if that CPU is already processing push requests,
and if so, then that CPU is skipped, and the next CPU in the rto_mask is
checked.

The IPI code now needs to call push_tasks() instead of just push_task() as
it will not be receiving an IPI for each CPU that is requesting a PULL.

This change removes this duplication of work in the IPI logic, and lowers
the latency caused by the IPIs greatly.

Signed-off-by: Steven Rostedt <rostedt@xxxxxxxxxxx>
---
kernel/sched/rt.c | 14 ++++++++++++--
kernel/sched/sched.h | 2 ++
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d5690b722691..165bcfdbd94b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -100,6 +100,7 @@ void init_rt_rq(struct rt_rq *rt_rq)
rt_rq->push_flags = 0;
rt_rq->push_cpu = nr_cpu_ids;
raw_spin_lock_init(&rt_rq->push_lock);
+ atomic_set(&rt_rq->ipi_count, 0);
init_irq_work(&rt_rq->push_work, push_irq_work_func);
#endif
#endif /* CONFIG_SMP */
@@ -1917,6 +1918,10 @@ static int find_next_push_cpu(struct rq *rq)
break;
next_rq = cpu_rq(cpu);

+ /* If pushing was already started, ignore */
+ if (atomic_read(&next_rq->rt.ipi_count))
+ continue;
+
/* Make sure the next rq can push to this rq */
if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
break;
@@ -1955,6 +1960,7 @@ static void tell_cpu_to_push(struct rq *rq)
return;

rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
+ atomic_inc(&cpu_rq(cpu)->rt.ipi_count);

irq_work_queue_on(&rq->rt.push_work, cpu);
}
@@ -1974,11 +1980,12 @@ static void try_to_push_tasks(void *arg)

rq = cpu_rq(this_cpu);
src_rq = rq_of_rt_rq(rt_rq);
+ WARN_ON_ONCE(!atomic_read(&rq->rt.ipi_count));

again:
if (has_pushable_tasks(rq)) {
raw_spin_lock(&rq->lock);
- push_rt_task(rq);
+ push_rt_tasks(rq);
raw_spin_unlock(&rq->lock);
}

@@ -2000,7 +2007,7 @@ again:
raw_spin_unlock(&rt_rq->push_lock);

if (cpu >= nr_cpu_ids)
- return;
+ goto out;

/*
* It is possible that a restart caused this CPU to be
@@ -2011,7 +2018,10 @@ again:
goto again;

/* Try the next RT overloaded CPU */
+ atomic_inc(&cpu_rq(cpu)->rt.ipi_count);
irq_work_queue_on(&rt_rq->push_work, cpu);
+out:
+ atomic_dec(&rq->rt.ipi_count);
}

static void push_irq_work_func(struct irq_work *work)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de607e4febd9..b47d580dfa84 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -476,6 +476,8 @@ struct rt_rq {
int push_cpu;
struct irq_work push_work;
raw_spinlock_t push_lock;
+ /* Used to skip CPUs being processed in the rto_mask */
+ atomic_t ipi_count;
#endif
#endif /* CONFIG_SMP */
int rt_queued;
--
1.9.3