Re: [PATCH] sched/rt: Add comments describing the RT IPI pull method

From: Peter Zijlstra
Date: Mon Feb 20 2017 - 06:12:51 EST


On Fri, Feb 17, 2017 at 04:20:49PM -0500, Steven Rostedt (VMware) wrote:
>
> From: "Steven Rostedt (VMware)" <rostedt@xxxxxxxxxxx>
>
> While looking into optimizations for the RT scheduler IPI logic, I realized
> that the comments are lacking to describe it efficiently. It deserves a
> lengthy description describing its design.
>
> Signed-off-by: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>
> ---
> kernel/sched/rt.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 63 insertions(+)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 4101f9d..cef9579 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1928,6 +1928,69 @@ static int find_next_push_cpu(struct rq *rq)
> #define RT_PUSH_IPI_EXECUTING 1
> #define RT_PUSH_IPI_RESTART 2
>
> +/*
> + * When a high priority task schedules out from a CPU and a lower priority
> + * task is scheduled in, a check is made to see if there's any RT tasks
> + * on other CPUs that are waiting to run because a higher priority RT task
> + * is currently running on its CPU. In this case, the CPU with multiple RT
> + * tasks queued on it (overloaded) needs to be notified that a CPU has opened
> +* up that may be able to run its sleeping RT task.

Whitespace hickup

Also, incorrect use of sleeping here, generally in the scheduler that
means blocked, not on runqueue, as in requires a wakeup.

> + *
> + * Instead of taking the rq lock of the overloaded CPU, which could cause
> + * a huge contention on boxes with several cores,

It is not immediately obvious to the casual reader _why_ this is so.

if lots of CPUs have just
> + * scheduled a lower priority task and there's only one overloaded CPU,
> + * these CPUs that are now running a lower priority task will simply send an
> + * IPI to the overloaded CPU with multiple RT tasks queued on it.
> + *
> + * Instead of sending an IPI to all CPUs that have overloaded RT tasks,
> + * only a single IPI is sent to one CPU. That one will try to push off its
> + * overloaded task and then send an IPI to the next CPU that has overloaded RT
> + * tasks. This stops when all CPUs with overloaded RT tasks have completed.
> + * Just because a CPU may have pushed off its own overloaded RT task does
> + * not mean it should stop sending the IPI around to other overloaded CPUs.
> + * There may be another RT task waiting to run on one of those CPUs that are
> + * of higher priority than the one that was just pushed.

My brain had a bit of bother following the verbosity there. Maybe cut
this into parts;

- broadcast IPIs are bad, $reason.

- Therefore send a single IPI that visits one after another.

- termination condition for visitor..

> + * An optimization that could possibly be made is to make a CPU array similar
> + * to the cpupri array mask of all running RT tasks, but for the overloaded
> + * case, then the IPI could be sent to only the CPU with the highest priority
> + * RT task waiting, and that CPU could send off further IPIs to the CPU with
> + * the next highest waiting task. Since the overloaded case is much less likely
> + * to happen, the complexity of this implementation may not be worth it.
> + * Instead, just send an IPI around to all overloaded CPUs.

That's a lot of maybe..

> + *
> + * The rq->rt.push_flags holds the status of the IPI that is going around.
> + * A run queue can only send out a single IPI at a time. The possible flags
> + * for rq->rt.push_flags are:
> + *
> + * (None or zero): No IPI is going around for the current rq
> + * RT_PUSH_IPI_EXECUTING: An IPI for the rq is being passed around
> + * RT_PUSH_IPI_RESTART: The priority of the running task for the rq
> + * has changed, and the IPI should restart
> + * circulating the overloaded CPUs again.

whitespace messup

> + *
> + * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated
> + * before sending to the next CPU.
> + *
> + * When an rq schedules a lower priority task than what was currently
> + * running, the next CPU with overloaded RT tasks is examined first.
> + * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower
> + * priority task, it will send an IPI first to CPU 5, then CPU 5 will
> + * send to CPU 1 if it is still overloaded.

+ $reason for this


CPU 1 will clear the
> + * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set.
> + *
> + * The first CPU to notice IPI_RESTART is set, will clear that flag and then
> + * send an IPI to the next overloaded CPU after the rq->cpu and not the next
> + * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3
> + * schedules a lower priority task, and the IPI_RESTART gets set while the
> + * handling is being done on CPU 5, it will clear the flag and send it back to
> + * CPU 4 instead of CPU 1.
> + *
> + * Note, the above logic can be disabled by turning off the sched_feature
> + * RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be
> + * taken by the CPU requesting a pull and the waiting RT task will be pulled
> + * by that CPU. This may be fine for machines with few CPUs.

Also whitespace fail.

In general, I feel this thing could do with a dose of _why_, we can get
details of the how from the code.