Re: [RFC PATCH 09/13] sched/fair: core wide vruntime comparison

From: Aaron Lu
Date: Tue Apr 14 2020 - 23:34:27 EST


On Tue, Apr 14, 2020 at 03:56:24PM +0200, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 04:59:59PM +0000, vpillai wrote:
> > From: Aaron Lu <aaron.lu@xxxxxxxxxxxxxxxxx>
> >
> > This patch provides a vruntime based way to compare two cfs task's
> > priority, be it on the same cpu or different threads of the same core.
> >
> > When the two tasks are on the same CPU, we just need to find a common
> > cfs_rq both sched_entities are on and then do the comparison.
> >
> > When the two tasks are on differen threads of the same core, the root
> > level sched_entities to which the two tasks belong will be used to do
> > the comparison.
> >
> > An ugly illustration for the cross CPU case:
> >
> > cpu0 cpu1
> > / | \ / | \
> > se1 se2 se3 se4 se5 se6
> > / \ / \
> > se21 se22 se61 se62
> >
> > Assume CPU0 and CPU1 are smt siblings and task A's se is se21 while
> > task B's se is se61. To compare priority of task A and B, we compare
> > priority of se2 and se6. Whose vruntime is smaller, who wins.
> >
> > To make this work, the root level se should have a common cfs_rq min
> > vuntime, which I call it the core cfs_rq min vruntime.
> >
> > When we adjust the min_vruntime of rq->core, we need to propgate
> > that down the tree so as to not cause starvation of existing tasks
> > based on previous vruntime.
>
> You forgot the time complexity analysis.

This is a mistake and the adjust should be needed only once when core
scheduling is initially enabled. It is an initialization thing and there
is no reason to do it in every invocation of coresched_adjust_vruntime().

Vineeth,

I think we have talked about this before and you agreed that it is
needed only once:
https://lore.kernel.org/lkml/20191012035503.GA113034@aaronlu/
https://lore.kernel.org/lkml/CANaguZBevMsQ_Zy1ozKn2Z5Uj6WBviC6UU+zpTQOVdDDLK6r2w@xxxxxxxxxxxxxx/

I'll see how to fix it, but feel free to beat me to it.

> > +static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
> > +{
> > + struct sched_entity *se, *next;
> > +
> > + if (!cfs_rq)
> > + return;
> > +
> > + cfs_rq->min_vruntime -= delta;
> > + rbtree_postorder_for_each_entry_safe(se, next,
> > + &cfs_rq->tasks_timeline.rb_root, run_node) {
>
> Which per this ^
>
> > + if (se->vruntime > delta)
> > + se->vruntime -= delta;
> > + if (se->my_q)
> > + coresched_adjust_vruntime(se->my_q, delta);
> > + }
> > +}
>
> > @@ -511,6 +607,7 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
> >
> > /* ensure we never gain time by being placed backwards. */
> > cfs_rq->min_vruntime = max_vruntime(cfs_rq_min_vruntime(cfs_rq), vruntime);
> > + update_core_cfs_rq_min_vruntime(cfs_rq);
> > #ifndef CONFIG_64BIT
> > smp_wmb();
> > cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
>
> as called from here, is exceedingly important.
>
> Worse, I don't think our post-order iteration is even O(n).
>
>
> All of this is exceedingly yuck.