Re: [PATCH -tip 10/32] sched: Fix priority inversion of cookied task with sibling
From: Joel Fernandes
Date: Tue Dec 01 2020 - 12:50:22 EST
On Thu, Nov 26, 2020 at 10:05:19AM +1100, Balbir Singh wrote:
> On Tue, Nov 24, 2020 at 01:30:38PM -0500, Joel Fernandes wrote:
> > On Mon, Nov 23, 2020 at 09:41:23AM +1100, Balbir Singh wrote:
> > > On Tue, Nov 17, 2020 at 06:19:40PM -0500, Joel Fernandes (Google) wrote:
> > > > From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > > >
> > > > The rationale is as follows. In the core-wide pick logic, even if
> > > > need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> > > > see if they could be running RT.
> > > >
> > > > Say the RQs in a particular core look like this:
> > > > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> > > >
> > > > rq0 rq1
> > > > CFS1 (tagged) RT1 (not tag)
> > > > CFS2 (tagged)
> > > >
> > > > Say schedule() runs on rq0. Now, it will enter the above loop and
> > > > pick_task(RT) will return NULL for 'p'. It will enter the above if() block
> > > > and see that need_sync == false and will skip RT entirely.
> > > >
> > > > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > > > rq0 rq1
> > > > CFS1 IDLE
> > > >
> > > > When it should have selected:
> > > > rq0 r1
> > > > IDLE RT
> > > >
> > > > Joel saw this issue on real-world usecases in ChromeOS where an RT task
> > > > gets constantly force-idled and breaks RT. Lets cure it.
> > > >
> > > > NOTE: This problem will be fixed differently in a later patch. It just
> > > > kept here for reference purposes about this issue, and to make
> > > > applying later patches easier.
> > > >
> > >
> > > The changelog is hard to read, it refers to above if(), whereas there
> > > is no code snippet in the changelog.
> >
> > Yeah sorry, it comes from this email where I described the issue:
> > http://lore.kernel.org/r/20201023175724.GA3563800@xxxxxxxxxx
> >
> > I corrected the changelog and appended the patch below. Also pushed it to:
> > https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=coresched
> >
> > > Also, from what I can see following
> > > the series, p->core_cookie is not yet set anywhere (unless I missed it),
> > > so fixing it in here did not make sense just reading the series.
> >
> > The interface patches for core_cookie are added later, that's how it is. The
> > infrastructure comes first here. It would also not make sense to add
> > interface first as well so I think the current ordering is fine.
> >
>
> Some comments below to help make the code easier to understand
>
> > ---8<-----------------------
> >
> > From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Subject: [PATCH] sched: Fix priority inversion of cookied task with sibling
> >
> > The rationale is as follows. In the core-wide pick logic, even if
> > need_sync == false, we need to go look at other CPUs (non-local CPUs) to
> > see if they could be running RT.
> >
> > Say the RQs in a particular core look like this:
> > Let CFS1 and CFS2 be 2 tagged CFS tags. Let RT1 be an untagged RT task.
> >
> > rq0 rq1
> > CFS1 (tagged) RT1 (not tag)
> > CFS2 (tagged)
> >
> > The end result of the selection will be (say prio(CFS1) > prio(CFS2)):
> > rq0 rq1
> > CFS1 IDLE
> >
> > When it should have selected:
> > rq0 r1
> > IDLE RT
> >
> > Fix this issue by forcing need_sync and restarting the search if a
> > cookied task was discovered. This will avoid this optimization from
> > making incorrect picks.
> >
> > Joel saw this issue on real-world usecases in ChromeOS where an RT task
> > gets constantly force-idled and breaks RT. Lets cure it.
> >
> > NOTE: This problem will be fixed differently in a later patch. It just
> > kept here for reference purposes about this issue, and to make
> > applying later patches easier.
> >
> > Reported-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> > Signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> > ---
> > kernel/sched/core.c | 25 ++++++++++++++++---------
> > 1 file changed, 16 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4ee4902c2cf5..53af817740c0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5195,6 +5195,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > need_sync = !!rq->core->core_cookie;
> >
> > /* reset state */
> > +reset:
> > rq->core->core_cookie = 0UL;
> > if (rq->core->core_forceidle) {
> > need_sync = true;
> > @@ -5242,14 +5243,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > /*
> > * If there weren't no cookies; we don't need to
> > * bother with the other siblings.
> > - * If the rest of the core is not running a tagged
> > - * task, i.e. need_sync == 0, and the current CPU
> > - * which called into the schedule() loop does not
> > - * have any tasks for this class, skip selecting for
> > - * other siblings since there's no point. We don't skip
> > - * for RT/DL because that could make CFS force-idle RT.
> > */
> > - if (i == cpu && !need_sync && class == &fair_sched_class)
> > + if (i == cpu && !need_sync)
> > goto next_class;
> >
> > continue;
> > @@ -5259,7 +5254,20 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > * Optimize the 'normal' case where there aren't any
> > * cookies and we don't need to sync up.
> > */
> > - if (i == cpu && !need_sync && !p->core_cookie) {
> > + if (i == cpu && !need_sync) {
> > + if (p->core_cookie) {
> > + /*
> > + * This optimization is only valid as
> > + * long as there are no cookies
>
> This is not entirely true, need_sync is a function of core cookies, so I
> think this needs more clarification, it sounds like we enter this when
> the core has no cookies, but the task has a core_cookie? The term cookie
> is quite overloaded when used in the context of core vs task.
>
> Effectively from what I understand this means that p wants to be
> coscheduled, but the core itself is not coscheduling anything at the
> moment, so we need to see if we should do a sync and that sync might
> cause p to get kicked out and a higher priority class to come in?
Yeah so about need_sync, it is basically a flag that says if the HT running
the schedule() loop needs to bother with siblings.
need_sync is true only in following conditions:
- A cookied task is running on any HT on the core.
- Any HT in the core is force idled.
The above code comment you referred to is now reworked. That was for the case
where we discovered during local selection that we found a task with a
cookie so now we have to do a core-wide scan (need_sync = false before but
now it becomes true and we start over). This optimization is done sligtly
differently now, we run ->pick_task() on every class of the local CPU until
we find something, if we find something with a cookie then we do core-wide
selection.
The latest version of this code is now in Peter's branch:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/tree/kernel/sched/core.c?id=6288c0a49631ce6b53eeab7021a43e49c4c4d436
- Joel