Re: [RFC PATCH 1/6] sched/fair: Add util_guest for tasks

From: Dietmar Eggemann
Date: Wed Apr 05 2023 - 06:51:02 EST


On 04/04/2023 03:11, David Dai wrote:
> On Mon, Apr 3, 2023 at 4:40 AM Dietmar Eggemann
> <dietmar.eggemann@xxxxxxx> wrote:
>>
>> Hi David,
> Hi Dietmar, thanks for your comments.
>>
>> On 31/03/2023 00:43, David Dai wrote:

[...]

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 6986ea31c984..998649554344 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4276,14 +4276,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf);
>>>
>>> static inline unsigned long task_util(struct task_struct *p)
>>> {
>>> - return READ_ONCE(p->se.avg.util_avg);
>>> + return max(READ_ONCE(p->se.avg.util_avg),
>>> + READ_ONCE(p->se.avg.util_guest));
>>> }
>>>
>>> static inline unsigned long _task_util_est(struct task_struct *p)
>>> {
>>> struct util_est ue = READ_ONCE(p->se.avg.util_est);
>>>
>>> - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
>>> + return max_t(unsigned long, READ_ONCE(p->se.avg.util_guest),
>>> + max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED)));
>>> }
>>
>> I can't see why the existing p->uclamp_req[UCLAMP_MIN].value can't be
>> used here instead p->se.avg.util_guest.
> Using p->uclamp_req[UCLAMP_MIN].value would result in folding in
> uclamp values into task_util and task_util_est for all tasks that have
> uclamp values set. The intent of these patches isn’t to modify
> existing uclamp behaviour. Users would also override util values from
> the guest when they set uclamp values.
>>
>> I do understand the issue of inheriting uclamp values at fork but don't
>> get the not being `additive` thing. We are at task level here.

> Uclamp values are max aggregated with other tasks at the runqueue
> level when deciding CPU frequency. For example, a vCPU runqueue may
> have an util of 512 that results in setting 512 to uclamp_min on the
> vCPU task. This is insufficient to drive a frequency response if it
> shares the runqueue with another host task running with util of 512 as
> it would result in a clamped util value of 512 at the runqueue(Ex. If
> a guest thread had just migrated onto this vCPU).

OK, see your point now. You want an accurate per-task boost for this
vCPU task on the host run-queue.
And a scenario in which a vCPU can ask for 100% in these moments is not
sufficient I guess? In this case uclamp_min could work.

>> The fact that you have to max util_avg and util_est directly in
>> task_util() and _task_util_est() tells me that there are places where
>> this helps and uclamp_task_util() is not called there.
> Can you clarify on this point a bit more?

Sorry, I meant s/util_est/util_guest/.

The effect of the change in _task_util_est() you see via:

enqueue_task_fair()
util_est_enqueue()
cfs_rq->avg.util_est.enqueued += _task_util_est(p)

so that `sugov_get_util() -> cpu_util_cfs() ->
cfs_rq->avg.util_est.enqueued` can see the effect of util_guest?

Not sure about the change in task_util() yet.

>> When you say in the cover letter that you tried uclamp_min, how exactly
>> did you use it? Did you run the existing mainline or did you use
>> uclamp_min as a replacement for util_guest in this patch here?

> I called sched_setattr_nocheck() with .sched_flags =
> SCHED_FLAG_UTIL_CLAMP when updating uclamp_min and clamp_max is left
> at 1024. Uclamp_min was not aggregated with task_util and
> task_util_est during my testing. The only caveat there is that I added
> a change to only reset uclamp on fork when testing(I realize there is
> specifically a SCHED_FLAG_RESET_ON_FORK, but I didn’t want to reset
> other sched attributes).

OK, understood. It's essentially a util_est v2 for vCPU tasks on host.

>>> static inline unsigned long task_util_est(struct task_struct *p)
>>> @@ -6242,6 +6244,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>> */
>>> util_est_enqueue(&rq->cfs, p);
>>>
>>> + /*
>>> + * The normal code path for host thread enqueue doesn't take into
>>> + * account guest task migrations when updating cpufreq util.
>>> + * So, always update the cpufreq when a vCPU thread has a
>>> + * non-zero util_guest value.
>>> + */
>>> + if (READ_ONCE(p->se.avg.util_guest))
>>> + cpufreq_update_util(rq, 0);
>>
>>
>> This is because enqueue_entity() -> update_load_avg() ->
>> attach_entity_load_avg() -> cfs_rq_util_change() requires root run-queue
>> (&rq->cfs == cfs_rq) to call cpufreq_update_util()?
> The enqueue_entity() would not call into update_load_avg() due to the
> check for !se->avg.last_update_time. se->avg.last_update_time is
> non-zero because the vCPU task did not migrate before this enqueue.
> This enqueue path is reached when util_guest is updated for the vCPU
> task through the sched_setattr_nocheck call where we want to ensure a
> frequency update occurs.

OK, vCPU tasks are pinned so always !WF_MIGRATED wakeup I guess?