Re: [PATCH 1/4] sched/fair: reorder enqueue/dequeue_task_fair path

From: Vincent Guittot
Date: Wed Feb 12 2020 - 09:47:46 EST


On Wed, 12 Feb 2020 at 14:20, Mel Gorman <mgorman@xxxxxxx> wrote:
>
> On Tue, Feb 11, 2020 at 06:46:48PM +0100, Vincent Guittot wrote:
> > The walk through the cgroup hierarchy during the enqueue/dequeue of a task
> > is split in 2 distinct parts for throttled cfs_rq without any added value
> > but making code less readable.
> >
> > Change the code ordering such that everything related to a cfs_rq
> > (throttled or not) will be done in the same loop.
> >
> > In addition, the same steps ordering is used when updating a cfs_rq:
> > - update_load_avg
> > - update_cfs_group
> > - update *h_nr_running
> >
> > No functional and performance changes are expected and have been noticed
> > during tests.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > ---
> > kernel/sched/fair.c | 42 ++++++++++++++++++++----------------------
> > 1 file changed, 20 insertions(+), 22 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1a0ce83e835a..a1ea02b5362e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5259,32 +5259,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > cfs_rq = cfs_rq_of(se);
> > enqueue_entity(cfs_rq, se, flags);
> >
> > - /*
> > - * end evaluation on encountering a throttled cfs_rq
> > - *
> > - * note: in the case of encountering a throttled cfs_rq we will
> > - * post the final h_nr_running increment below.
> > - */
> > - if (cfs_rq_throttled(cfs_rq))
> > - break;
> > cfs_rq->h_nr_running++;
> > cfs_rq->idle_h_nr_running += idle_h_nr_running;
> >
> > + /* end evaluation on encountering a throttled cfs_rq */
> > + if (cfs_rq_throttled(cfs_rq))
> > + goto enqueue_throttle;
> > +
> > flags = ENQUEUE_WAKEUP;
> > }
> >
> > for_each_sched_entity(se) {
> > cfs_rq = cfs_rq_of(se);
> > - cfs_rq->h_nr_running++;
> > - cfs_rq->idle_h_nr_running += idle_h_nr_running;
> >
> > + /* end evaluation on encountering a throttled cfs_rq */
> > if (cfs_rq_throttled(cfs_rq))
> > - break;
> > + goto enqueue_throttle;
> > AFAICT, there are in tip/sched/core
> > update_load_avg(cfs_rq, se, UPDATE_TG);
> > update_cfs_group(se);
> > +
> > + cfs_rq->h_nr_running++;
> > + cfs_rq->idle_h_nr_running += idle_h_nr_running;
> > }
> >
> > +enqueue_throttle:
> > if (!se) {
> > add_nr_running(rq, 1);
> > /*
>
> I'm having trouble reconciling the patch with the description and the
> comments explaining the intent behind the code are unhelpful.
>
> There are two loops before and after your patch -- the first dealing with
> sched entities that are not on a runqueue and the second for the remaining
> entities that are. The intent appears to be to update the load averages
> once the entity is active on a runqueue.
>
> I'm not getting why the changelog says everything related to cfs is
> now done in one loop because there are still two. But even if you do
> get throttled, it's not clear why you jump to the !se check given that
> for_each_sched_entity did not complete. What it *does* appear to do is
> have all the h_nr_running related to entities being enqueued updated in
> one loop and all remaining entities stats updated in the other.

Let's take the example of 2 levels in addition to root so we have :
root->cfs1->cfs2
Now we enqueue a task T1 on cfs2 but cfs1 is throttled, we will have
the sequence:

In 1st for_each_sched_entity loop:
loop 1
enqueue_entity (T1->se, cfs2) which calls update load_avg(cfs2)
cfs2->h_nr_running++;
loop 2
enqueue_entity (cfs2->gse, cfs1) which calls update load_avg(cfs1)
break because cfs1 is throttled

In 2nd for_each_sched_entity loop:
loop 1
cfs1->h_nr_running++
break because throttled

Using the 2nd loop for incrementing h_nr_running of the throttled cfs
is useless and we could do that directly in 1st loop and skip the 2nd
loop

With this patch we have :

In 1st for_each_sched_entity loop:
loop 1
enqueue_entity (T1->se, cfs2) which update load_avg(cfs2)
cfs2->h_nr_running++;
loop 2
enqueue_entity (cfs2->gse, cfs1) which update load_avg(cfs1)
cfs1->h_nr_running++
skip the 2nd for_each_sched_entity entirely

Then the patch also reorders the call to update_load_avg() and the
increment of h_nr_running

Before the patch we had different order between the to
for_each_sched_entity which is not a problem because there is
currently no relation between both. But the following patches make
PELT using h_nr_running so we must have the same ordering to prevent
updating pelt with the wrong h_nr_running value

>
> Following the accounting is tricky. Before the patch, if throttling
> happened then h_nr_running was updated without updating the corresponding
> nr_running counter in rq. They are out of sync until unthrottle_cfs_rq
> is called at the very least. After your patch, the same is true and while
> the accounting appears to be equivalent, it's not clear it's correct and
> I do not find the code any easier to understand after the patch or how
> it's connected to runnable_load_avg which this series is about :(
>
> I think the patch is functionally ok but I am having trouble figuring
> out the motive. Maybe it'll be obvious after I read the rest of the series.
>
> --
> Mel Gorman
> SUSE Labs