Re: [PATCH] sched/fair: Fix enqueue_task_fair warning some more
From: Phil Auld
Date: Wed May 06 2020 - 14:05:36 EST
Hi Vincent,
Thanks for taking a look. More below...
On Wed, May 06, 2020 at 06:36:45PM +0200 Vincent Guittot wrote:
> Hi Phil,
>
> - reply to all this time
>
> On Wed, 6 May 2020 at 16:18, Phil Auld <pauld@xxxxxxxxxx> wrote:
> >
> > sched/fair: Fix enqueue_task_fair warning some more
> >
> > The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
> > did not fully resolve the issues with the (rq->tmp_alone_branch !=
> > &rq->leaf_cfs_rq_list) warning in enqueue_task_fair. There is a case where
> > the first for_each_sched_entity loop exits due to on_rq, having incompletely
> > updated the list. In this case the second for_each_sched_entity loop can
> > further modify se. The later code to fix up the list management fails to do
>
> But for the 2nd for_each_sched_entity, the cfs_rq should already be
> in the list, isn't it ?
No. In this case we hit the parent not on list case in list_add_leaf_cfs_rq
which sets rq-tmp_alone_branch to cfs_rq->leaf_cfs_rq_list which is not
the same. It returns false expecting the parent to be added later.
But then the parent doens't get there because it's on_rq.
>
> The third for_each_entity loop is there for the throttled case but is
> useless for other case
>
There actually is a throttling involved usually. The second loop breaks out
early because one of the parents is throttled. But not before it advances
se at least once.
Then the 3rd loop doesn't fix the tmp_alone_branch because it doesn't start
with the right se.
> Could you provide us some details about the use case that creates such
> a situation ?
>
I admit I had to add trace_printks to get here. Here's what it showed (sorry
for the long lines...)
1) sh-6271 [044] 1271.322317: bprint: enqueue_task_fair: se 0xffffa085e7e30080 on_rq 0 cfs_rq = 0xffffa085e93da200
2) sh-6271 [044] 1271.322320: bprint: enqueue_entity: Add_leaf_rq: cpu 17: nr_r 2; cfs 0xffffa085e93da200 onlist 0 tmp_a_b = 0xffffa085ef92c868 &rq->l_c_r_l = 0xffffa085ef92c868
3) sh-6271 [044] 1271.322322: bprint: enqueue_entity: Add_leaf_rq: cpu 17: nr_r 2: parent not onlist Set t_a_branch to 0xffffa085e93da340 rq->l_c_r_l = 0xffffa085ef92c868
4) sh-6271 [044] 1271.322323: bprint: enqueue_task_fair: se 0xffffa085e93d8800 on_rq 1 cfs_rq = 0xffffa085dbfaea00
5) sh-6271 [044] 1271.322324: bprint: enqueue_task_fair: Done enqueues, se=0xffffa085e93d8800, pid=3642
6) sh-6271 [044] 1271.322326: bprint: enqueue_task_fair: update: cfs 0xffffa085e48ce000 throttled, se = 0xffffa085dbfafc00
7) sh-6271 [044] 1271.322326: bprint: enqueue_task_fair: current se = 0xffffa085dbfafc00, orig_se = 0xffffa085e7e30080
8) sh-6271 [044] 1271.322327: bprint: enqueue_task_fair: Add_leaf_rq: cpu 17: nr_r 2; cfs 0xffffa085e48ce000 onlist 1 tmp_a_b = 0xffffa085e93da340 &rq->l_c_r_l = 0xffffa085ef92c868
9) sh-6271 [044] 1271.322328: bprint: enqueue_task_fair: Add_leaf_rq: cpu 17: nr_r 0; cfs 0xffffa085ef92bf80 onlist 1 tmp_a_b = 0xffffa085e93da340 &rq->l_c_r_l = 0xffffa085ef92c868
10) sh-6271 [044] 1271.672599: bprint: enqueue_task_fair: cpu 17: rq->tmp_alone_branch = 0xffffa085e93da340 != &rq->leaf_cfs_rq_list = 0xffffa085ef92c868
lines 1 and 4 are from the first loop in enqueue_task_fair. Line 2 and 3 are from the
first call to list_add_leaf_rq with line 2 being at the start and line 3 showing which
of the 3 cases we hit.
Line 5 is right after the first loop.
Line 6 is the second trip through the 2nd loop and is in the if(throttled) condition.
Line 7 is right below the enqueue_throttle label.
Lines 8 and 9 are from the fixup loop and since onlist is set for both of these it doesn't
do anything. But we've left rq->tmp_alone_branch pointing to the cfs_rq->leaf_cfs_rq_list
from the one call to list_add_leaf_rq that did something and so the cleanup doesn't work.
Based on the comment at the clean up, it looked like it expected the se to be what it was
when the first loop broke not whatever it was left at after the second loop. Could have
been NULL there too I guess but I didn't hit that case.
This is 100% reproducible. And completely gone with the fix. I have a trace showing that.
Does that make more sense?
Cheers,
Phil
> > what is needed because se does not point to the sched_entity which broke out
> > of the first loop.
> >
> > Address this issue by saving the se pointer when the first loop exits and
> > resetting it before doing the fix up, if needed.
> >
> > Signed-off-by: Phil Auld <pauld@xxxxxxxxxx>
> > Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> > Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
> > ---
> > kernel/sched/fair.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 02f323b85b6d..719c996317e3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5432,6 +5432,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > {
> > struct cfs_rq *cfs_rq;
> > struct sched_entity *se = &p->se;
> > + struct sched_entity *saved_se = NULL;
> > int idle_h_nr_running = task_has_idle_policy(p);
> >
> > /*
> > @@ -5466,6 +5467,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > flags = ENQUEUE_WAKEUP;
> > }
> >
> > + saved_se = se;
> > for_each_sched_entity(se) {
> > cfs_rq = cfs_rq_of(se);
> >
> > @@ -5510,6 +5512,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > * leaf list maintenance, resulting in triggering the assertion
> > * below.
> > */
> > + if (saved_se)
> > + se = saved_se;
> > for_each_sched_entity(se) {
> > cfs_rq = cfs_rq_of(se);
> >
> > --
> > 2.18.0
> >
> >
> > Cheers,
> > Phil
> >
>
--