Re: [PATCH v4 5/8] sched/fair: Take into account latency priority at wakeup

From: Vincent Guittot
Date: Mon Sep 19 2022 - 11:39:48 EST


On Mon, 19 Sept 2022 at 12:05, Dietmar Eggemann
<dietmar.eggemann@xxxxxxx> wrote:
>
> On 16/09/2022 10:03, Vincent Guittot wrote:
>
> [...]
>
> > @@ -4512,7 +4519,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> > p->prio = current->normal_prio;
> >
> > /* Propagate the parent's latency requirements to the child as well */
> > - p->latency_nice = current->latency_nice;
> > + p->latency_prio = current->latency_prio;
>
> Isn't here a `set_latency_offset(p)` missing here?

Hmm, I think it's the opposite and the line above is a nop from the
beginning (i.e. patch 2).

>
> >
> > uclamp_fork(p);
> >
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e8c1b889dcbb..a20eadb0af97 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4574,6 +4574,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > update_idle_cfs_rq_clock_pelt(cfs_rq);
> > }
> >
> > +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se);
>
> minor: `struct sched_entity *curr` ... doesn't have to be current
> (cfs_rq->curr). Isn't this more like `struct sched_entity *sea, struct
> sched_entity *seb`? Anyway, it's already the case for
> `wakeup_preempt_entity`.
>
> [...]
>
> > @@ -5732,6 +5735,35 @@ static int sched_idle_cpu(int cpu)
> > }
> > #endif
> >
> > +static void set_next_buddy(struct sched_entity *se);
> > +
> > +static void check_preempt_from_others(struct cfs_rq *cfs, struct sched_entity *se)
>
> minor: Why `struct cfs_rq *cfs` and not `struct cfs_rq *cfs_rq` ?
>
> Using `cfs_rq` would make it more consistent when looking for things
> like `cfs_rq->nr_running` for example.
>
> > +{
> > + struct sched_entity *next;
> > +
> > + if (se->latency_offset >= 0)
> > + return;
> > +
> > + if (cfs->nr_running <= 1)
> > + return;
> > + /*
> > + * When waking from idle, we don't need to check to preempt at wakeup
>
> s/idle/others ?

yes, I forgot to update the comment

>
> > + * the idle thread and don't set next buddy as a candidate for being
> > + * picked in priority.
> > + * In case of simultaneous wakeup from idle, the latency sensitive tasks
> > + * lost opportunity to preempt non sensitive tasks which woke up
> > + * simultaneously.
> > + */
>
> The position of this comment block within this function is somehow
> misleading since it describes the reason for the function rather then a
> particular condition within this function. Wouldn't it be more readable
> when it would be a function header comment instead?

I put it after the usual early return tests to put the comment close
to the useful part: the use of next buddy and __pick_first_entity()

>
> [...]
>
> > @@ -7148,6 +7181,22 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > }
> > #endif /* CONFIG_SMP */
> >
> > +static long wakeup_latency_gran(struct sched_entity *curr, struct sched_entity *se)
> > +{
> > + long latency_offset = se->latency_offset;
> > +
> > + /*
> > + * A negative latency weigth means that the sched_entity has latency
>
> s/weigth/latency_offset ?

yes

>
>
> > + * requirement that needs to be evaluated versus other entity.
> > + * Otherwise, use the latency weight to evaluate how much scheduling
> > + * delay is acceptable by se.
> > + */
> > + if ((se->latency_offset < 0) || (curr->latency_offset < 0))
> > + latency_offset -= curr->latency_offset;
>
> I still don't get the rationale behind why when either one (se or curr)
> of the latency_nice values is negative, we use the diff between them but
> if not, we only care about se's value. Why don't you always use the diff
> between se and curr? Since we have a range [-20 ... 19] why shouldn't we
> use the difference between let's say se = 19 and curr = 5?
> You discussed this with Tao Zhou on the v1 but I didn't understand it fully.

Let say that current has a latency nice prio of 19 and a task A with a
latency nice of 10 wakes up. Both tasks don't care about scheduling
latency (current more than task A). If we use the diff, the output of
wakeup_latency_gran() would be negative (-10ms) which reflects the
fact that the waking task is sensitive to the latency and wants to
preempt current even if its vruntime is after. But obviously both
current and task A don't care to preempt at wakeup.

>
> [...]