Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

From: Dietmar Eggemann
Date: Mon Apr 03 2023 - 07:40:41 EST


Hi David,

On 31/03/2023 00:43, David Dai wrote:
> For virtualization usecases, util_est and util_avg currently tracked
> on the host aren't sufficient to accurately represent the workload on
> vCPU threads, which results in poor frequency selection and performance.
> For example, when a large workload migrates from a busy vCPU thread to
> an idle vCPU thread, it incurs additional DVFS ramp-up latencies
> as util accumulates.
>
> Introduce a new "util_guest" member as an additional PELT signal that's
> independently updated by the guest. When used, it's max aggregated to
> provide a boost to both task_util and task_util_est.
>
> Updating task_util and task_util_est will ensure:
> -Better task placement decisions for vCPU threads on the host
> -Correctly updating util_est.ewma during dequeue
> -Additive util with other threads on the same runqueue for more
> accurate frequency responses
>
> Co-developed-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> Signed-off-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> Signed-off-by: David Dai <davidai@xxxxxxxxxx>
> ---

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6986ea31c984..998649554344 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
>
> static inline unsigned long task_util(struct task_struct *p)
> {
> - return READ_ONCE(p->se.avg.util_avg);
> + return max(READ_ONCE(p->se.avg.util_avg),
> + READ_ONCE(p->se.avg.util_guest));
> }
>
> static inline unsigned long _task_util_est(struct task_struct *p)
> {
> struct util_est ue = READ_ONCE(p->se.avg.util_est);
>
> - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
> + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
> + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
> }

I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
used here instead p->se.avg.util_guest.

I do understand the issue of inheriting uclamp values at fork but don't
get the not being `additive` thing. We are at task level here.

The fact that you have to max util_avg and util_est directly in
task_util() and _task_util_est() tells me that there are places where
this helps and uclamp_task_util() is not called there.

When you say in the cover letter that you tried uclamp_min, how exactly
did you use it? Did you run the existing mainline or did you use
uclamp_min as a replacement for util_guest in this patch here?

> static inline unsigned long task_util_est(struct task_struct *p)
> @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> */
> util_est_enqueue(&rq->cfs, p);
>
> + /*
> + * The normal code path for host thread enqueue doesn't take into
> + * account guest task migrations when updating cpufreq util.
> + * So, always update the cpufreq when a vCPU thread has a
> + * non-zero util_guest value.
> + */
> + if (READ_ONCE(p->se.avg.util_guest))
> + cpufreq_update_util(rq, 0);


This is because enqueue_entity() -> update_load_avg() ->
attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
(&rq->cfs == cfs_rq) to call cpufreq_update_util()?

[...]