Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

From: David Dai
Date: Wed Apr 05 2023 - 19:36:26 EST


On Wed, Apr 5, 2023 at 2:43 PM Saravana Kannan <saravanak@xxxxxxxxxx> wrote:
>
> On Wed, Apr 5, 2023 at 3:50 AM Dietmar Eggemann
> <dietmar.eggemann@xxxxxxx> wrote:
> >
> > On 04/04/2023 03:11, David Dai wrote:
> > > On Mon, Apr 3, 2023 at 4:40 AM Dietmar Eggemann
> > > <dietmar.eggemann@xxxxxxx> wrote:
> > >>
> > >> Hi David,
> > > Hi Dietmar, thanks for your comments.
> > >>
> > >> On 31/03/2023 00:43, David Dai wrote:
> >
> > [...]
> >
> > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > >>> index 6986ea31c984..998649554344 100644
> > >>> --- a/kernel/sched/fair.c
> > >>> +++ b/kernel/sched/fair.c
> > >>> @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
> > >>>
> > >>> static inline unsigned long task_util(struct task_struct *p)
> > >>> {
> > >>> - return READ_ONCE(p->se.avg.util_avg);
> > >>> + return max(READ_ONCE(p->se.avg.util_avg),
> > >>> + READ_ONCE(p->se.avg.util_guest));
> > >>> }
> > >>>
> > >>> static inline unsigned long _task_util_est(struct task_struct *p)
> > >>> {
> > >>> struct util_est ue = READ_ONCE(p->se.avg.util_est);
> > >>>
> > >>> - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
> > >>> + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
> > >>> + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
> > >>> }
> > >>
> > >> I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
> > >> used here instead p->se.avg.util_guest.
> > > Using p->uclamp_req[UCLAMP_MIN].value would result in folding in
> > > uclamp values into task_util and task_util_est for all tasks that have
> > > uclamp values set. The intent of these patches isn’t to modify
> > > existing uclamp behaviour. Users would also override util values from
> > > the guest when they set uclamp values.
> > >>
> > >> I do understand the issue of inheriting uclamp values at fork but don't
> > >> get the not being `additive` thing. We are at task level here.
> >
> > > Uclamp values are max aggregated with other tasks at the runqueue
> > > level when deciding CPU frequency. For example, a vCPU runqueue may
> > > have an util of 512 that results in setting 512 to uclamp_min on the
> > > vCPU task. This is insufficient to drive a frequency response if it
> > > shares the runqueue with another host task running with util of 512 as
> > > it would result in a clamped util value of 512 at the runqueue(Ex. If
> > > a guest thread had just migrated onto this vCPU).
> >
> > OK, see your point now. You want an accurate per-task boost for this
> > vCPU task on the host run-queue.
> > And a scenario in which a vCPU can ask for 100% in these moments is not
> > sufficient I guess? In this case uclamp_min could work.
>
> Right. vCPU can have whatever utilization and there can be random host
> threads completely unrelated to the VM. And we need to aggregate both
> of their util when deciding CPU freq.
>
> >
> > >> The fact that you have to max util_avg and util_est directly in
> > >> task_util() and _task_util_est() tells me that there are places where
> > >> this helps and uclamp_task_util() is not called there.
> > > Can you clarify on this point a bit more?
> >
> > Sorry, I meant s/util_est/util_guest/.
> >
> > The effect of the change in _task_util_est() you see via:
> >
> > enqueue_task_fair()
> > util_est_enqueue()
> > cfs_rq->avg.util_est.enqueued += _task_util_est(p)
> >
> > so that `sugov_get_util() -> cpu_util_cfs() ->
> > cfs_rq->avg.util_est.enqueued` can see the effect of util_guest?

That sequence looks correct to me.

> >
> > Not sure about the change in task_util() yet.

task_util() provides some signaling in addition to task_util_est() via:

find_energy_effcient_cpu()
cpu_util_next()
lsub_positive(&util, task_util(p));
...
util += task_util(p);
//Can provide a better signal than util_est.

dequeue_task_fair()
util_est_update()
ue.enqueued = task_util(p);
//Updates ue.ewma

Thanks,
David

> >
> > >> When you say in the cover letter that you tried uclamp_min, how exactly
> > >> did you use it? Did you run the existing mainline or did you use
> > >> uclamp_min as a replacement for util_guest in this patch here?
> >
> > > I called sched_setattr_nocheck() with .sched_flags =
> > > SCHED_FLAG_UTIL_CLAMP when updating uclamp_min and clamp_max is left
> > > at 1024. Uclamp_min was not aggregated with task_util and
> > > task_util_est during my testing. The only caveat there is that I added
> > > a change to only reset uclamp on fork when testing(I realize there is
> > > specifically a SCHED_FLAG_RESET_ON_FORK, but I didn’t want to reset
> > > other sched attributes).
> >
> > OK, understood. It's essentially a util_est v2 for vCPU tasks on host.
>
> Yup. We initially looked into just overwriting util_est, but didn't
> think that'll land well with the community :) as it was a bit messier
> because we needed to make sure the current util_est update paths don't
> run for vCPU tasks on host (because those values would be wrong).
>
> > >>> static inline unsigned long task_util_est(struct task_struct *p)
> > >>> @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >>> */
> > >>> util_est_enqueue(&rq->cfs, p);
> > >>>
> > >>> + /*
> > >>> + * The normal code path for host thread enqueue doesn't take into
> > >>> + * account guest task migrations when updating cpufreq util.
> > >>> + * So, always update the cpufreq when a vCPU thread has a
> > >>> + * non-zero util_guest value.
> > >>> + */
> > >>> + if (READ_ONCE(p->se.avg.util_guest))
> > >>> + cpufreq_update_util(rq, 0);
> > >>
> > >>
> > >> This is because enqueue_entity() -> update_load_avg() ->
> > >> attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
> > >> (&rq->cfs == cfs_rq) to call cpufreq_update_util()?
> > > The enqueue_entity() would not call into update_load_avg() due to the
> > > check for !se->avg.last_update_time. se->avg.last_update_time is
> > > non-zero because the vCPU task did not migrate before this enqueue.
> > > This enqueue path is reached when util_guest is updated for the vCPU
> > > task through the sched_setattr_nocheck call where we want to ensure a
> > > frequency update occurs.
> >
> > OK, vCPU tasks are pinned so always !WF_MIGRATED wakeup I guess?
>
> Even if say little-vCPU threads are allowed to migrate within little
> CPUs, this will still be an issue. While a vCPU thread is continuously
> running on a single CPU, a guest thread can migrate into that vCPU and
> cause a huge increase in util_guest. But that won't trigger an cpufreq
> update on the host side because the host doesn't see a task migration.
> That's what David is trying to address.
>
> -Saravana