Re: [RFC PATCH v2 1/2] sched/fair: Introduce UTIL_FITS_CAPACITY feature (v2)

From: Chen Yu
Date: Wed Oct 25 2023 - 03:20:43 EST


On 2023-10-24 at 10:49:37 -0400, Mathieu Desnoyers wrote:
> On 2023-10-24 02:10, Chen Yu wrote:
> > On 2023-10-23 at 11:04:49 -0400, Mathieu Desnoyers wrote:
> > > On 2023-10-23 10:11, Dietmar Eggemann wrote:
> > > > On 19/10/2023 18:05, Mathieu Desnoyers wrote:
> > >
> > > [...]
> > > > > +static unsigned long scale_rt_capacity(int cpu);
> > > > > +
> > > > > +/*
> > > > > + * Returns true if adding the task utilization to the estimated
> > > > > + * utilization of the runnable tasks on @cpu does not exceed the
> > > > > + * capacity of @cpu.
> > > > > + *
> > > > > + * This considers only the utilization of _runnable_ tasks on the @cpu
> > > > > + * runqueue, excluding blocked and sleeping tasks. This is achieved by
> > > > > + * using the runqueue util_est.enqueued.
> > > > > + */
> > > > > +static inline bool task_fits_remaining_cpu_capacity(unsigned long task_util,
> > > > > + int cpu)
> > > >
> > > > Or like find_energy_efficient_cpu() (feec(), used in
> > > > Energy-Aware-Scheduling (EAS)) which uses cpu_util(cpu, p, cpu, 0) to get:
> > > >
> > > > max(util_avg(CPU + p), util_est(CPU + p))
> > >
> > > I've tried using cpu_util(), but unfortunately anything that considers
> > > blocked/sleeping tasks in its utilization total does not work for my
> > > use-case.
> > >
> > > From cpu_util():
> > >
> > > * CPU utilization is the sum of running time of runnable tasks plus the
> > > * recent utilization of currently non-runnable tasks on that CPU.
> > >
> >
> > I thought cpu_util() indicates the utilization decay sum of task that was once
> > "running" on this CPU, but will not sum up the "util/load" of the blocked/sleeping
> > task?
> >
> > accumulate_sum()
> > /* only the running task's util will be sum up */
> > if (running)
> > sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
> >
> > WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
>
> The accumulation into the cfs_rq->avg.util_sum indeed only happens when the task
> is running, which means that the task does not actively contribute to increment
> the util_sum when it is blocked/sleeping.
>
> However, when the task is blocked/sleeping, the task is still attached to the
> runqueue, and therefore its historic util_sum still contributes to the cfs_rq
> util_sum/util_avg. This completely differs from what happens when the task is
> migrated to a different runqueue, in which case its util_sum contribution is
> entirely removed from the cfs_rq util_sum:
>
> static void
> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> [...]
> update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH)
> [...]
>
> static void
> dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> [...]
> if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> action |= DO_DETACH;
> [...]
>
> static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> [...]
> if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>
> /*
> * DO_ATTACH means we're here from enqueue_entity().
> * !last_update_time means we've passed through
> * migrate_task_rq_fair() indicating we migrated.
> *
> * IOW we're enqueueing a task on a new CPU.
> */
> attach_entity_load_avg(cfs_rq, se);
> update_tg_load_avg(cfs_rq);
>
> } else if (flags & DO_DETACH) {
> /*
> * DO_DETACH means we're here from dequeue_entity()
> * and we are migrating task out of the CPU.
> */
> detach_entity_load_avg(cfs_rq, se);
> update_tg_load_avg(cfs_rq);
> [...]
>
> In comparison, util_est_enqueue()/util_est_dequeue() are called from enqueue_task_fair()
> and dequeue_task_fair(), which include blocked/sleeping tasks scenarios. Therefore, util_est
> only considers runnable tasks in its cfs_rq->avg.util_est.enqueued.
>
> The current rq utilization total used for rq selection should not include historic
> utilization of all blocked/sleeping tasks, because we are taking a decision to bring
> back a recently blocked/sleeping task onto a runqueue at that point. Considering
> the historic util_sum from the set of other blocked/sleeping tasks still attached to that
> runqueue in the current utilization mistakenly makes the rq selection think that the rq is
> busier than it really is.
>

Thanks for the description in detail, it is very helpful! Now I understand that using cpu_util()
could overestimate the busyness of the CPU in UTIL_FITS_CAPACITY.

> I suspect that cpu_util_without() is an half-successful attempt at solving this by removing
> the task p from the considered utilization, but it does not take into account scenarios where many
> other tasks happen to be blocked/sleeping as well.

Agree, those non-migrated tasks could contribute to cfs_rq's util_avg.

thanks,
Chenyu