Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

From: Saravana Kannan
Date: Wed Apr 05 2023 - 17:43:16 EST


On Wed, Apr 5, 2023 at 3:50 AM Dietmar Eggemann
<dietmar.eggemann@xxxxxxx> wrote:
>
> On 04/04/2023 03:11, David Dai wrote:
> > On Mon, Apr 3, 2023 at 4:40 AM Dietmar Eggemann
> > <dietmar.eggemann@xxxxxxx> wrote:
> >>
> >> Hi David,
> > Hi Dietmar, thanks for your comments.
> >>
> >> On 31/03/2023 00:43, David Dai wrote:
>
> [...]
>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 6986ea31c984..998649554344 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
> >>>
> >>> static inline unsigned long task_util(struct task_struct *p)
> >>> {
> >>> - return READ_ONCE(p->se.avg.util_avg);
> >>> + return max(READ_ONCE(p->se.avg.util_avg),
> >>> + READ_ONCE(p->se.avg.util_guest));
> >>> }
> >>>
> >>> static inline unsigned long _task_util_est(struct task_struct *p)
> >>> {
> >>> struct util_est ue = READ_ONCE(p->se.avg.util_est);
> >>>
> >>> - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
> >>> + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
> >>> + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
> >>> }
> >>
> >> I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
> >> used here instead p->se.avg.util_guest.
> > Using p->uclamp_req[UCLAMP_MIN].value would result in folding in
> > uclamp values into task_util and task_util_est for all tasks that have
> > uclamp values set. The intent of these patches isn’t to modify
> > existing uclamp behaviour. Users would also override util values from
> > the guest when they set uclamp values.
> >>
> >> I do understand the issue of inheriting uclamp values at fork but don't
> >> get the not being `additive` thing. We are at task level here.
>
> > Uclamp values are max aggregated with other tasks at the runqueue
> > level when deciding CPU frequency. For example, a vCPU runqueue may
> > have an util of 512 that results in setting 512 to uclamp_min on the
> > vCPU task. This is insufficient to drive a frequency response if it
> > shares the runqueue with another host task running with util of 512 as
> > it would result in a clamped util value of 512 at the runqueue(Ex. If
> > a guest thread had just migrated onto this vCPU).
>
> OK, see your point now. You want an accurate per-task boost for this
> vCPU task on the host run-queue.
> And a scenario in which a vCPU can ask for 100% in these moments is not
> sufficient I guess? In this case uclamp_min could work.

Right. vCPU can have whatever utilization and there can be random host
threads completely unrelated to the VM. And we need to aggregate both
of their util when deciding CPU freq.

>
> >> The fact that you have to max util_avg and util_est directly in
> >> task_util() and _task_util_est() tells me that there are places where
> >> this helps and uclamp_task_util() is not called there.
> > Can you clarify on this point a bit more?
>
> Sorry, I meant s/util_est/util_guest/.
>
> The effect of the change in _task_util_est() you see via:
>
> enqueue_task_fair()
> util_est_enqueue()
> cfs_rq->avg.util_est.enqueued += _task_util_est(p)
>
> so that `sugov_get_util() -> cpu_util_cfs() ->
> cfs_rq->avg.util_est.enqueued` can see the effect of util_guest?
>
> Not sure about the change in task_util() yet.
>
> >> When you say in the cover letter that you tried uclamp_min, how exactly
> >> did you use it? Did you run the existing mainline or did you use
> >> uclamp_min as a replacement for util_guest in this patch here?
>
> > I called sched_setattr_nocheck() with .sched_flags =
> > SCHED_FLAG_UTIL_CLAMP when updating uclamp_min and clamp_max is left
> > at 1024. Uclamp_min was not aggregated with task_util and
> > task_util_est during my testing. The only caveat there is that I added
> > a change to only reset uclamp on fork when testing(I realize there is
> > specifically a SCHED_FLAG_RESET_ON_FORK, but I didn’t want to reset
> > other sched attributes).
>
> OK, understood. It's essentially a util_est v2 for vCPU tasks on host.

Yup. We initially looked into just overwriting util_est, but didn't
think that'll land well with the community :) as it was a bit messier
because we needed to make sure the current util_est update paths don't
run for vCPU tasks on host (because those values would be wrong).

> >>> static inline unsigned long task_util_est(struct task_struct *p)
> >>> @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>> */
> >>> util_est_enqueue(&rq->cfs, p);
> >>>
> >>> + /*
> >>> + * The normal code path for host thread enqueue doesn't take into
> >>> + * account guest task migrations when updating cpufreq util.
> >>> + * So, always update the cpufreq when a vCPU thread has a
> >>> + * non-zero util_guest value.
> >>> + */
> >>> + if (READ_ONCE(p->se.avg.util_guest))
> >>> + cpufreq_update_util(rq, 0);
> >>
> >>
> >> This is because enqueue_entity() -> update_load_avg() ->
> >> attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
> >> (&rq->cfs == cfs_rq) to call cpufreq_update_util()?
> > The enqueue_entity() would not call into update_load_avg() due to the
> > check for !se->avg.last_update_time. se->avg.last_update_time is
> > non-zero because the vCPU task did not migrate before this enqueue.
> > This enqueue path is reached when util_guest is updated for the vCPU
> > task through the sched_setattr_nocheck call where we want to ensure a
> > frequency update occurs.
>
> OK, vCPU tasks are pinned so always !WF_MIGRATED wakeup I guess?

Even if say little-vCPU threads are allowed to migrate within little
CPUs, this will still be an issue. While a vCPU thread is continuously
running on a single CPU, a guest thread can migrate into that vCPU and
cause a huge increase in util_guest. But that won't trigger an cpufreq
update on the host side because the host doesn't see a task migration.
That's what David is trying to address.

-Saravana