Re: [PATCH v6] sched: Consolidate cpufreq updates
From: Qais Yousef
Date: Wed Jul 24 2024 - 17:23:09 EST
On 07/09/24 09:48, Vincent Guittot wrote:
> On Wed, 19 Jun 2024 at 22:14, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >
> > Improve the interaction with cpufreq governors by making the
> > cpufreq_update_util() calls more intentional.
> >
> > At the moment we send them when load is updated for CFS, bandwidth for
> > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > sent in a short period of time and potentially be ignored at a critical
> > moment due to the rate_limit_us in schedutil.
> >
> > For example, simultaneous task enqueue on the CPU where 2nd task is
> > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > the first task will lead to dropping the 2nd request until tick. Or
> > another CPU in the same policy triggers a freq update shortly after.
> >
> > Updates at enqueue for RT are not strictly required. Though they do help
> > to reduce the delay for switching the frequency and the potential
> > observation of lower frequency during this delay. But current logic
> > doesn't intentionally (at least to my understanding) try to speed up the
> > request.
> >
> > To help reduce the amount of cpufreq updates and make them more
> > purposeful, consolidate them into these locations:
> >
> > 1. context_switch()
> > 2. task_tick_fair()
> > 3. update_blocked_averages()
> > 4. on syscall that changes policy or uclamp values
> > 5. on check_preempt_wakeup_fair() if wakeup preemption failed
>
> So this above is the new thing to take care of enqueues that generate
> sudden updates of util_est and could require a frequency change, isn't
> it ?
Yes.
>
> >
> > The update at context switch should help guarantee that DL and RT get
> > the right frequency straightaway when they're RUNNING. As mentioned
> > though the update will happen slightly after enqueue_task(); though in
> > an ideal world these tasks should be RUNNING ASAP and this additional
>
> That's probably ok for RT although it means possibly running up to the
> switch at lowest frequency but I suppose it's not a big concern as it
> is already the case if the RT task will end up on a different CPU than
> the local one.
>
> I'm more concerned about DL tasks. cpu_bw_dl() reflects the min
> bandwidth/capacity to run all enqueued DL tasks in time. The dl
> bandwidth is updated when a DL task is enqueued and this
> bandwidth/capacity should be applied immediately and not at the next
> context switch otherwise you will not have enough bandwidth if the
> newly enqueued DL task does not preempt current DL task
Hmm. Yes.
I moved the DL updates to __add_running_bw() with the new FORCE_UDPATE flag to
ensure rate limit doesn't accidentally block it. No need for an update at
dequeue though as the context switch will handle that.
> > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time,
> > + unsigned int flags)
> > {
> > s64 delta_ns;
> >
> > + delta_ns = time - sg_policy->last_freq_update_time;
> > +
> > + /*
> > + * We want to update cpufreq at context switch, but on systems with
> > + * long TICK values, this can happen after a long time while more tasks
> > + * would have been added meanwhile leaving us potentially running at
> > + * inadequate frequency for extended period of time.
> > + *
> > + * This logic should only apply when new fair task was added to the
> > + * CPU, we'd want to defer to context switch as much as possible, but
> > + * to avoid the potential delays mentioned above, let's check if this
> > + * additional tasks warrants sending an update sooner.
> > + *
> > + * We want to ensure there's at least an update every
> > + * sysctl_sched_base_slice.
> > + */
> > + if (likely(flags & SCHED_CPUFREQ_TASK_ENQUEUED)) {
> > + if (delta_ns < sysctl_sched_base_slice)
>
> I'm not sure that this is the right condition. This value seems quite
> long to me and not that much different from a 4ms tick. Should we use
> the 1024us of the pelt
I thought about using 1ms, but opted for this. Comparing against NSEC_PER_MSEC
now. I think our base_slice should be 1ms by default, but maybe I am trying to
do too much in one go :)
>
> Also, I run the use case that I ran previously and I have cases where
> we wait more than 4.5 ms between the enqueue of the big task and the
> freq update (with a 1ms tick period) so There are probably some corner
> cases which are not correctly handled.
>
> My use case is 2 tasks running on the same cpu. 1 short task A running
> 500us with a period of 19457us and 1 long task B running 19000 us with
> a period of 139777us. The periods are set to never be in sync with the
> tick which is 1 ms. There are cases when task B wakes up while task A
> is already running and doesn't preempt A and the freq update happens
> only 4.5ms after task B wakes up and task A went back to sleep whereas
> we should switch immediatly from 760Mhz to 1958 Mhz
Thanks for that. From what I see we have two problems:
1. rq->cfs.decayed is not set to true after the new enqueue.
2. rate_limit_us can still block the update and there's nothing we can do
aboutu this here.
I could reproduce the problem without my patch by the way.
When we fail to preempt, I force rq->cfs.decayed be true always now. Not sure
if we're hitting another bug or decayed can be false sometimes after an
enqueue.
> > @@ -614,6 +619,7 @@ int __sched_setscheduler(struct task_struct *p,
> > int retval, oldprio, newprio, queued, running;
> > const struct sched_class *prev_class;
> > struct balance_callback *head;
> > + bool update_cpufreq;
> > struct rq_flags rf;
> > int reset_on_fork;
> > int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
> > @@ -796,7 +802,8 @@ int __sched_setscheduler(struct task_struct *p,
> > __setscheduler_params(p, attr);
> > __setscheduler_prio(p, newprio);
> > }
> > - __setscheduler_uclamp(p, attr);
> > +
> > + update_cpufreq = __setscheduler_uclamp(p, attr);
> >
> > if (queued) {
> > /*
> > @@ -811,7 +818,14 @@ int __sched_setscheduler(struct task_struct *p,
> > if (running)
> > set_next_task(rq, p);
> >
> > - check_class_changed(rq, p, prev_class, oldprio);
> > + update_cpufreq |= check_class_changed(rq, p, prev_class, oldprio);
> > +
> > + /*
> > + * Changing class or uclamp value implies requiring to send cpufreq
> > + * update.
> > + */
> > + if (update_cpufreq && running)
>
> Why running ? it should be queued as we are max aggregating
The new logic is to ignore changes at enqueue but defer them to context switch
when a task is RUNNING.
That said, we do actually send an update if we fail to preempt now at enqueue.
I will do something similar here for queued task to match the new logic.
Thanks!
--
Qais Yousef
>
> > + update_cpufreq_current(rq);
> >
> > /* Avoid rq from going away on us: */
> > preempt_disable();
> > --
> > 2.34.1
> >