Re: [PATCH v4] sched: Consolidate cpufreq updates

From: Hongyan Xia
Date: Fri May 17 2024 - 07:07:05 EST

Next message: Gao Xiang: "Re: [PATCH] erofs: clean up erofs_show_options()"
Previous message: Krzysztof Wilczyński: "Re: [PATCH v2 1/2] PCI: endpoint: pci-epf-test: Make use of cached 'epc_features' in pci_epf_test_core_init()"
In reply to: Qais Yousef: "[PATCH v4] sched: Consolidate cpufreq updates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 16/05/2024 21:48, Qais Yousef wrote:

Improve the interaction with cpufreq governors by making the
cpufreq_update_util() calls more intentional.

At the moment we send them when load is updated for CFS, bandwidth for
DL and at enqueue/dequeue for RT. But this can lead to too many updates
sent in a short period of time and potentially be ignored at a critical
moment due to the rate_limit_us in schedutil.

For example, simultaneous task enqueue on the CPU where 2nd task is
bigger and requires higher freq. The trigger to cpufreq_update_util() by
the first task will lead to dropping the 2nd request until tick. Or
another CPU in the same policy triggers a freq update shortly after.

Updates at enqueue for RT are not strictly required. Though they do help
to reduce the delay for switching the frequency and the potential
observation of lower frequency during this delay. But current logic
doesn't intentionally (at least to my understanding) try to speed up the
request.

To help reduce the amount of cpufreq updates and make them more
purposeful, consolidate them into these locations:

1. context_switch()
2. task_tick_fair()
3. update_blocked_averages()
4. on syscall that changes policy or uclamp values

The update at context switch should help guarantee that DL and RT get
the right frequency straightaway when they're RUNNING. As mentioned
though the update will happen slightly after enqueue_task(); though in
an ideal world these tasks should be RUNNING ASAP and this additional
delay should be negligible. For fair tasks we need to make sure we send
a single update for every decay for the root cfs_rq. Any changes to the
rq will be deferred until the next task is ready to run, or we hit TICK.
But we are guaranteed the task is running at a level that meets its
requirements after enqueue.

To guarantee RT and DL tasks updates are never missed, we add a new
SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
already running at the right freq, the governor will end up doing
nothing, but we eliminate the risk of the task ending up accidentally
running at the wrong freq due to rate_limit_us.

Similarly for iowait boost, we ignore rate limits. We also handle a case
of a boost reset prematurely by adding a guard in sugov_iowait_apply()
to reduce the boost after 1ms which seems iowait boost mechanism relied
on rate_limit_us and cfs_rq.decay preventing any updates to happen soon
after iowait boost.

The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
time stamps otherwise we can end up delaying updates for normal
requests.

As a simple optimization, we avoid sending cpufreq updates when
switching from RT to another RT as RT tasks run at max freq by default.
If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if
uclamp_min is different to avoid unnecessary cpufreq update as most RT
tasks are likely to be running at the same performance level, so we can
avoid unnecessary overhead of forced updates when there's nothing to do.

We also ensure to ignore cpufreq udpates for sugov workers at context
switch. It doesn't make sense for the kworker that applies the frequency
update (which is a DL task) to trigger a frequency update itself.

The update at task_tick_fair will guarantee that the governor will
follow any updates to load for tasks/CPU or due to new enqueues/dequeues
to the rq. Since DL and RT always run at constant frequencies and have
no load tracking, this is only required for fair tasks.

The update at update_blocked_averages() will ensure we decay frequency
as the CPU becomes idle for long enough.

If the currently running task changes its policy or uclamp values, we
ensure we follow up with cpufreq update to ensure we follow up with any
potential new perf requirements based on the new change.

[...]

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index bdd31ab93bc5..2d0a45aba16f 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -8,7 +8,8 @@
* Interface between cpufreq drivers and the scheduler:
*/
-#define SCHED_CPUFREQ_IOWAIT (1U << 0)
+#define SCHED_CPUFREQ_IOWAIT (1U << 0)
+#define SCHED_CPUFREQ_FORCE_UPDATE (1U << 1) /* ignore transition_delay_us */
#ifdef CONFIG_CPU_FREQ
struct cpufreq_policy;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1a914388144a..d0c97a66627a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -152,6 +152,9 @@ const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
__read_mostly int scheduler_running;
+static __always_inline void
+update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev);
+
#ifdef CONFIG_SCHED_CORE
DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
@@ -1958,7 +1961,7 @@ static bool uclamp_reset(const struct sched_attr *attr,
return false;
}
-static void __setscheduler_uclamp(struct task_struct *p,
+static void __setscheduler_uclamp(struct rq *rq, struct task_struct *p,
const struct sched_attr *attr)
{
enum uclamp_id clamp_id;
@@ -1980,7 +1983,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
value = uclamp_none(clamp_id);
uclamp_se_set(uc_se, value, false);
-
}
if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
@@ -1997,6 +1999,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
attr->sched_util_max, true);
}
+
+ /*
+ * Updating uclamp values has impact on freq, ensure it is taken into
+ * account.
+ */
+ if (task_current(rq, p))
+ update_cpufreq_ctx_switch(rq, NULL);

Do we care about updating the frequency here? p is dequeued during the __setscheduler_uclamp() call, so I think it's better to do this after the uclamp() call and after enqueue_task(), so that uclamp_rq_inc() comes into effect.

Also, do we want to limit the update to task_current()? Updating a uclamp_min of a task on this rq (even though it is not current) should raise the minimum OPP for the whole rq. An example is that if a uclamp_min task gets enqueued, the uclamp_min should kick in even if this task isn't immediately run and the current task isn't this task.

}
static void uclamp_fork(struct task_struct *p)
[...]

Next message: Gao Xiang: "Re: [PATCH] erofs: clean up erofs_show_options()"
Previous message: Krzysztof Wilczyński: "Re: [PATCH v2 1/2] PCI: endpoint: pci-epf-test: Make use of cached 'epc_features' in pci_epf_test_core_init()"
In reply to: Qais Yousef: "[PATCH v4] sched: Consolidate cpufreq updates"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]