Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

From: Joel Fernandes
Date: Tue Sep 15 2020 - 16:09:24 EST


On Fri, Aug 28, 2020 at 03:51:09PM -0400, Julien Desfossez wrote:
> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
>
> During a CPU hotplug event, schedule would be called with the hotplugged
> CPU not in the cpumask. So use for_each_cpu(_wrap)_or to include the
> current cpu in the task pick loop.
>
> There are multiple loops in pick_next_task that iterate over CPUs in
> smt_mask. During a hotplug event, sibling could be removed from the
> smt_mask while pick_next_task is running. So we cannot trust the mask
> across the different loops. This can confuse the logic. Add a retry logic
> if smt_mask changes between the loops.
[...]
> +static struct task_struct *
> +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> +{
[...]
> + /* reset state */
> + rq->core->core_cookie = 0UL;
> + for_each_cpu_or(i, smt_mask, cpumask_of(cpu)) {
> + struct rq *rq_i = cpu_rq(i);
> +
> + rq_i->core_pick = NULL;
> +
> + if (rq_i->core_forceidle) {
> + need_sync = true;
> + rq_i->core_forceidle = false;
> + }
> +
> + if (i != cpu)
> + update_rq_clock(rq_i);
> + }
> +
> + /*
> + * Try and select tasks for each sibling in decending sched_class
> + * order.
> + */
> + for_each_class(class) {
> +again:
> + for_each_cpu_wrap_or(i, smt_mask, cpumask_of(cpu), cpu) {
> + struct rq *rq_i = cpu_rq(i);
> + struct task_struct *p;
> +
> + /*
> + * During hotplug online a sibling can be added in
> + * the smt_mask * while we are here. If so, we would
> + * need to restart selection by resetting all over.
> + */
> + if (unlikely(smt_weight != cpumask_weight(smt_mask)))
> + goto retry_select;
> +
> + if (rq_i->core_pick)
> + continue;
> +
> + /*
> + * If this sibling doesn't yet have a suitable task to
> + * run; ask for the most elegible task, given the
> + * highest priority task already selected for this
> + * core.
> + */
> + p = pick_task(rq_i, class, max);
> + if (!p) {
> + /*
> + * If there weren't no cookies; we don't need
> + * to bother with the other siblings.
> + */
> + if (i == cpu && !need_sync)
> + goto next_class;

We find that there is a problem with this code, it force idles RT tasks
whenever one sibling is running tagged CFS task, with the other one running
untagged RT task.

Below diff should fix it, still testing it:

---8<-----------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cc90eb50d481..26d05043b640 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4880,10 +4880,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
p = pick_task(rq_i, class, max);
if (!p) {
/*
- * If there weren't no cookies; we don't need
- * to bother with the other siblings.
+ * If the rest of the core is not running a
+ * tagged task, i.e. need_sync == 0, and the
+ * current CPU which called into the schedule()
+ * loop does not have any tasks for this class,
+ * skip selecting for other siblings since
+ * there's no point. We don't skip for RT/DL
+ * because that could make CFS force-idle RT.
*/
- if (i == cpu && !need_sync)
+ if (i == cpu && !need_sync && class == &fair_sched_class)
goto next_class;

continue;