Re: [PATCH] sched/core: An optimization of pick_next_task() not sure

From: Peter Zijlstra
Date: Mon Aug 16 2021 - 14:54:15 EST


On Mon, Aug 16, 2021 at 11:44:01PM +0800, Tao Zhou wrote:
> When find a new candidate max, wipe the stale and start over.
> Goto again: and use the new max to loop to pick the the task.
>
> Here first want to get the max of the core and use this new
> max to loop once to pick the task on each thread.
>
> Not sure this is an optimization and just stop here a little
> and move on..
>

Did you find this retry was an issue on your workload? Or was this from
reading the source?

> ---
> kernel/sched/core.c | 52 +++++++++++++++++----------------------------
> 1 file changed, 20 insertions(+), 32 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 20ffcc044134..bddcd328df96 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5403,7 +5403,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> const struct sched_class *class;
> const struct cpumask *smt_mask;
> bool fi_before = false;
> - int i, j, cpu, occ = 0;
> + int i, cpu, occ = 0;
> bool need_sync;
>
> if (!sched_core_enabled(rq))
> @@ -5508,11 +5508,27 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> * order.
> */
> for_each_class(class) {
> -again:
> + struct rq *rq_i;
> + struct task_struct *p;
> +
> for_each_cpu_wrap(i, smt_mask, cpu) {
> - struct rq *rq_i = cpu_rq(i);
> - struct task_struct *p;
> + rq_i = cpu_rq(i);
> + p = pick_task(rq_i, class, max, fi_before);
> + /*
> + * If this new candidate is of higher priority than the
> + * previous; and they're incompatible; pick_task makes
> + * sure that p's priority is more than max if it doesn't
> + * match max's cookie. Update max.
> + *
> + * NOTE: this is a linear max-filter and is thus bounded
> + * in execution time.
> + */
> + if (!max || !cookie_match(max, p))
> + max = p;
> + }
>
> + for_each_cpu_wrap(i, smt_mask, cpu) {
> + rq_i = cpu_rq(i);
> if (rq_i->core_pick)
> continue;
>

This now calls pick_task() twice for each CPU, which seems unfortunate;
perhaps add q->core_temp storage to cache that result. Also, since the
first iteration is now explicitly about the max filter, perhaps we
shouuld move that part of pick_task() into the loop and simplify things
further?