Re: [PATCH updated v2] sched/fair: core wide cfs task priority comparison

From: Aaron Lu
Date: Fri May 15 2020 - 23:42:50 EST


On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote:
> On Fri, May 08, 2020 at 08:34:57PM +0800, Aaron Lu wrote:
> > With this said, I realized a workaround for the issue described above:
> > when the core went from 'compatible mode'(step 1-3) to 'incompatible
> > mode'(step 4), reset all root level sched entities' vruntime to be the
> > same as the core wide min_vruntime. After all, the core is transforming
> > from two runqueue mode to single runqueue mode... I think this can solve
> > the issue to some extent but I may miss other scenarios.
>
> A little something like so, this syncs min_vruntime when we switch to
> single queue mode. This is very much SMT2 only, I got my head in twist
> when thikning about more siblings, I'll have to try again later.

Thanks a lot for the patch, I now see that "there is no need to adjust
every se's vruntime". :-)

> This very much retains the horrible approximation of S we always do.
>
> Also, it is _completely_ untested...

I've been testing it.

One problem below.

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4293,10 +4281,11 @@ static struct task_struct *
> pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> {
> struct task_struct *next, *max = NULL;
> + int old_active = 0, new_active = 0;
> const struct sched_class *class;
> const struct cpumask *smt_mask;
> - int i, j, cpu;
> bool need_sync = false;
> + int i, j, cpu;
>
> cpu = cpu_of(rq);
> if (cpu_is_offline(cpu))
> @@ -4349,10 +4338,14 @@ pick_next_task(struct rq *rq, struct tas
> rq_i->core_pick = NULL;
>
> if (rq_i->core_forceidle) {
> + // XXX is_idle_task(rq_i->curr) && rq_i->nr_running ??
> need_sync = true;
> rq_i->core_forceidle = false;
> }
>
> + if (!is_idle_task(rq_i->curr))
> + old_active++;
> +
> if (i != cpu)
> update_rq_clock(rq_i);
> }
> @@ -4463,8 +4456,12 @@ next_class:;
>
> WARN_ON_ONCE(!rq_i->core_pick);
>
> - if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
> - rq_i->core_forceidle = true;
> + if (is_idle_task(rq_i->core_pick)) {
> + if (rq_i->nr_running)
> + rq_i->core_forceidle = true;
> + } else {
> + new_active++;
> + }
>
> if (i == cpu)
> continue;
> @@ -4476,6 +4473,16 @@ next_class:;
> WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
> }
>
> + /* XXX SMT2 only */
> + if (new_active == 1 && old_active > 1) {

There is a case when incompatible task appears but we failed to 'drop
into single-rq mode' per the above condition check. The TLDR is: when
there is a task that sits on the sibling rq with the same cookie as
'max', new_active will be 2 instead of 1 and that would cause us missing
the chance to do a sync of core min_vruntime.

This is how it happens:
1) 2 tasks of the same cgroup with different weight running on 2 siblings,
say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound
at cpu1(assume cpu0 and cpu1 are siblings);
2) Since new_active == 2, we didn't trigger min_vruntime sync. For
simplicity, let's assume both siblings' root cfs_rq's min_vruntime and
core_vruntime are all at 0 now;
3) let the two tasks run a while;
4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's
existing task has a very small weight, its cfs_rq's min_vruntime can
be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is
much larger than cg0_A's and the 'max' of the core wide task
selection goes to cg0_A;
5) Now I suppose we should drop into single-rq mode and by doing a sync
of core min_vruntime, cg1_C's turn shall come. But the problem is, our
current selection logic prefer not to waste CPU time so after decides
cg0_A as the 'max', the sibling will also do a cookie_pick() and
get cg0_B to run. This is where problem asises: new_active is 2
instead of the expected 1.
6) Due to we didn't do the sync of core min_vruntime, the newly queued
cg1_C shall wait a long time before cg0_A's vruntime catches up.

One naive way of precisely determine when to drop into single-rq mode is
to track how many tasks of a particular tag exists and use that to
decide if the core is in compatible mode(all tasks belong to the same
cgroup, IOW, have the same core_cookie) or not and act accordingly,
except that: does this sound too complex and inefficient?...

> + /*
> + * We just dropped into single-rq mode, increment the sequence
> + * count to trigger the vruntime sync.
> + */
> + rq->core->core_sync_seq++;
> + }
> + rq->core->core_active = new_active;
> +
> done:
> set_next_task(rq, next);
> return next;