Re: [PATCH v8] sched: Consolidate cpufreq updates

From: Ingo Molnar
Date: Sun Feb 23 2025 - 04:14:21 EST



* Qais Yousef <qyousef@xxxxxxxxxxx> wrote:

> On 02/21/25 16:47, Ingo Molnar wrote:
> >
> > * Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >
> > > ---
> > > include/linux/sched/cpufreq.h | 4 +-
> > > kernel/sched/core.c | 116 +++++++++++++++++++++++++++--
> > > kernel/sched/cpufreq_schedutil.c | 122 +++++++++++++++++++------------
> > > kernel/sched/deadline.c | 10 ++-
> > > kernel/sched/fair.c | 84 +++++++++------------
> > > kernel/sched/rt.c | 8 +-
> > > kernel/sched/sched.h | 9 ++-
> > > kernel/sched/syscalls.c | 30 ++++++--
> > > 8 files changed, 266 insertions(+), 117 deletions(-)
> >
> > The changelog is rather long, and the diffstat is non-trivial.
> >
> > Could you please split this up into multiple patches?
>
> Sure. I did consider that but what stopped me is that I couldn't see
> how I could break them into independent patches. A lot of corner
> cases needed to be addressed and if I moved them to their own patches
> I'd potentially break bisectability of this code. If this is not a
> problem then I can see how I can do a better split. If it is a
> problem, I'll still try to think it over but it might require a bit
> of stretching. But I admit I didn't try to think it over that hard.

Yeah, so bisectability should definitely be preserved.

I had a quick look, and these changes look fairly easy to split up to
reduce size/complexity of individual patches. The following split looks
pretty natural:

# ============{ Preparatory changes with no change in functionality: }=========>

[PATCH 1/9] Extend check_class_changed() with the 'class_changed' return bool
# But don't use it at call sites yet

[PATCH 2/9] Introduce and maintain the sugov_cpu::last_iowait_update metric
# But don't use it yet

[PATCH 3/9] Extend sugov_iowait_apply() with a 'flags' parameter
# But don't use it yet internally

[PATCH 4/9] Extend sugov_next_freq_shared() with the 'flags' parameter
# But don't use it yet internally

[PATCH 5/9] Clean up the enqueue_task_fair() control flow to make it easier to extend
# This adds the goto restructuring but doesn't change functionality

[PATCH 6/9] Introduce and maintain the cfs_rq::decayed flag
# But don't use it yet

[PATCH 7/8] Extend __setscheduler_uclamp() with the 'update_cpufreq' return bool
# But don't use it yet

# ============{ Behavioral changes: }==========>

[PATCH 8/9] Change sugov_iowait_apply() behavior
[PATCH 9/9] Change sugov_next_freq_shared() bahavior

... etc.

This is only a quick stab at the most trivial split-ups, it's not a
complete list, and I saw other opportunities for split-up too. Please
make these changes as finegrained as possible, as it changes behavior
and there is a fair chance of behavioral regressions down the road -
especially as the patch itself notes that even the new logic isn't
perfect yet.

If the behavioral changes can be split into further steps, that would
be preferable too.

Also:

- Please make the rq->cfs.decayed logic unconditional on UP too, even
if it's not used. This reduces some of the ugly #ifdeffery AFAICS.

- Please don't add prototypes for internal static functions like
__update_cpufreq_ctx_switch(), define the functions in the right
order instead.

- Also, please read your comments and fix typos:

+ * This logic relied on PELT signal decays happening once every
+ * 1ms. But due to changes to how updates are done now, we can
+ * end up with more request coming up leading to iowait boost
+ * to be prematurely reduced. Make the assumption explicit
+ * until we improve the iowait boost logic to be better in
+ * general as it is due for an overhaul.

s/request
/requests

+ * We want to update cpufreq at context switch, but on systems with
+ * long TICK values, this can happen after a long time while more tasks
+ * would have been added meanwhile leaving us potentially running at
+ * inadequate frequency for extended period of time.

Either 'an inadequate frequency' or 'inadequate frequencies'.

+ * This logic should only apply when new fair task was added to the
+ * CPU, we'd want to defer to context switch as much as possible, but
+ * to avoid the potential delays mentioned above, let's check if this
+ * additional tasks warrants sending an update sooner.

s/when new fair task
/when a new fair task

s/this additional tasks
/this additional task

(I haven't checked the spelling exhaustively, there might be more.)

Thanks,

Ingo