Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

From: David Dai
Date: Mon Apr 03 2023 - 21:12:05 EST


On Mon, Apr 3, 2023 at 4:40 AM Dietmar Eggemann
<dietmar.eggemann@xxxxxxx> wrote:
>
> Hi David,
Hi Dietmar, thanks for your comments.
>
> On 31/03/2023 00:43, David Dai wrote:
> > For virtualization usecases, util_est and util_avg currently tracked
> > on the host aren't sufficient to accurately represent the workload on
> > vCPU threads, which results in poor frequency selection and performance.
> > For example, when a large workload migrates from a busy vCPU thread to
> > an idle vCPU thread, it incurs additional DVFS ramp-up latencies
> > as util accumulates.
> >
> > Introduce a new "util_guest" member as an additional PELT signal that's
> > independently updated by the guest. When used, it's max aggregated to
> > provide a boost to both task_util and task_util_est.
> >
> > Updating task_util and task_util_est will ensure:
> > -Better task placement decisions for vCPU threads on the host
> > -Correctly updating util_est.ewma during dequeue
> > -Additive util with other threads on the same runqueue for more
> > accurate frequency responses
> >
> > Co-developed-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > Signed-off-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > Signed-off-by: David Dai <davidai@xxxxxxxxxx>
> > ---
>
> [...]
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6986ea31c984..998649554344 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
> >
> > static inline unsigned long task_util(struct task_struct *p)
> > {
> > - return READ_ONCE(p->se.avg.util_avg);
> > + return max(READ_ONCE(p->se.avg.util_avg),
> > + READ_ONCE(p->se.avg.util_guest));
> > }
> >
> > static inline unsigned long _task_util_est(struct task_struct *p)
> > {
> > struct util_est ue = READ_ONCE(p->se.avg.util_est);
> >
> > - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
> > + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
> > + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
> > }
>
> I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
> used here instead p->se.avg.util_guest.
Using p->uclamp_req[UCLAMP_MIN].value would result in folding in
uclamp values into task_util and task_util_est for all tasks that have
uclamp values set. The intent of these patches isn’t to modify
existing uclamp behaviour. Users would also override util values from
the guest when they set uclamp values.
>
> I do understand the issue of inheriting uclamp values at fork but don't
> get the not being `additive` thing. We are at task level here.
Uclamp values are max aggregated with other tasks at the runqueue
level when deciding CPU frequency. For example, a vCPU runqueue may
have an util of 512 that results in setting 512 to uclamp_min on the
vCPU task. This is insufficient to drive a frequency response if it
shares the runqueue with another host task running with util of 512 as
it would result in a clamped util value of 512 at the runqueue(Ex. If
a guest thread had just migrated onto this vCPU).
>
> The fact that you have to max util_avg and util_est directly in
> task_util() and _task_util_est() tells me that there are places where
> this helps and uclamp_task_util() is not called there.
Can you clarify on this point a bit more?
>
> When you say in the cover letter that you tried uclamp_min, how exactly
> did you use it? Did you run the existing mainline or did you use
> uclamp_min as a replacement for util_guest in this patch here?
I called sched_setattr_nocheck() with .sched_flags =
SCHED_FLAG_UTIL_CLAMP when updating uclamp_min and clamp_max is left
at 1024. Uclamp_min was not aggregated with task_util and
task_util_est during my testing. The only caveat there is that I added
a change to only reset uclamp on fork when testing(I realize there is
specifically a SCHED_FLAG_RESET_ON_FORK, but I didn’t want to reset
other sched attributes).
>
> > static inline unsigned long task_util_est(struct task_struct *p)
> > @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > */
> > util_est_enqueue(&rq->cfs, p);
> >
> > + /*
> > + * The normal code path for host thread enqueue doesn't take into
> > + * account guest task migrations when updating cpufreq util.
> > + * So, always update the cpufreq when a vCPU thread has a
> > + * non-zero util_guest value.
> > + */
> > + if (READ_ONCE(p->se.avg.util_guest))
> > + cpufreq_update_util(rq, 0);
>
>
> This is because enqueue_entity() -> update_load_avg() ->
> attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
> (&rq->cfs == cfs_rq) to call cpufreq_update_util()?
The enqueue_entity() would not call into update_load_avg() due to the
check for !se->avg.last_update_time. se->avg.last_update_time is
non-zero because the vCPU task did not migrate before this enqueue.
This enqueue path is reached when util_guest is updated for the vCPU
task through the sched_setattr_nocheck call where we want to ensure a
frequency update occurs.
>
> [...]