[PATCH] sched/topology: Allow EAS without schedutil for artificial Energy Models
From: Lucas de Lima Nóbrega
Date: Mon Jun 29 2026 - 04:36:27 EST
EAS currently refuses to enable energy-aware scheduling on a root
domain unless schedutil is the active CPUFreq governor for all of its
CPUs (cpufreq_ready_for_eas()). This requirement exists to protect the
accuracy of the energy estimate: EAS predicts the OPP a CPU will run
at from its utilization, which is only meaningful if the active
governor actually requests OPPs that way, and schedutil is the only
one that does.
That requirement does not apply to artificial Energy Models
(EM_PERF_DOMAIN_ARTIFICIAL). An artificial EM is built from a
get_cost() callback instead of real power numbers, and only encodes a
cost ranking between CPUs (e.g. P-cores cost more than E-cores at a
given utilization). It never claims to predict real energy use at any
specific OPP, so there is no per-OPP accuracy for the governor
requirement to protect, regardless of which governor is in control or
whether it tracks utilization at all.
intel_pstate registers exactly this kind of artificial EM for hybrid
(P/E-core) systems without SMT, regardless of whether it operates in
active or passive mode. In active mode it never uses schedutil, since
HWP picks frequency autonomously, so on these systems EAS never
engages even though SD_ASYM_CPUCAPACITY, frequency invariance and the
EM are all in place: find_energy_efficient_cpu() is never reached
because is_rd_overutilized() is hardcoded to true whenever
sched_energy_enabled() is false. cppc_cpufreq registers the same kind
of ranking-only artificial EM and is affected the same way with any
non-schedutil governor.
Allow EAS to be enabled when every CPU's EM in the root domain is
artificial, even when schedutil is not the active governor.
Tested on a Raptor Lake-P laptop with nosmt=force and intel_pstate in
active/HWP mode: find_energy_efficient_cpu() was never called before
this change (confirmed via the sched_overutilized_tp tracepoint and
ftrace) and is exercised as expected afterwards.
Signed-off-by: Lucas de Lima Nóbrega <lucaslnobrega38@xxxxxxxxx>
---
Documentation/admin-guide/pm/intel_pstate.rst | 9 ++++--
Documentation/scheduler/sched-energy.rst | 7 ++++-
kernel/sched/topology.c | 28 +++++++++++++++++--
3 files changed, 38 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index 25fe5d88f..c8fef1e60 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -409,13 +409,16 @@ Energy-Aware Scheduling Support
If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and
``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling
:ref:`CAS` it registers an Energy Model for the processor. This allows the
-Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if
-``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate``
-to operate in the :ref:`passive mode <passive_mode>`.
+Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler.
The Energy Model registered by ``intel_pstate`` is artificial (that is, it is
based on abstract cost values and it does not include any real power numbers)
and it is relatively simple to avoid unnecessary computations in the scheduler.
+Because of that, EAS does not require ``schedutil`` to be used as the
+``CPUFreq`` governor in this case: the cost ranking it relies on does not
+depend on the governor tracking utilization when requesting frequencies, so
+EAS works the same way regardless of whether ``intel_pstate`` operates in the
+active or in the :ref:`passive mode <passive_mode>`.
There is a performance domain in it for every CPU in the system and the cost
values for these performance domains have been chosen so that running a task on
a less performant (small) CPU appears to be always cheaper than running that
diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst
index 4e47aaf10..c23ca226d 100644
--- a/Documentation/scheduler/sched-energy.rst
+++ b/Documentation/scheduler/sched-energy.rst
@@ -379,7 +379,12 @@ Consequently, the only sane governor to use together with EAS is schedutil,
because it is the only one providing some degree of consistency between
frequency requests and energy predictions.
-Using EAS with any other governor than schedutil is not supported.
+Using EAS with any other governor than schedutil is not supported, unless the
+EM in use is artificial (see EM_PERF_DOMAIN_ARTIFICIAL). An artificial EM only
+encodes a cost ranking between CPUs/OPPs instead of a real power table, so it
+does not make any claim about energy use at a specific OPP and its conclusions
+do not depend on the governor actually tracking utilization when requesting
+frequencies.
6.5 Scale-invariant utilization signals
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9..124a4bb4d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -212,6 +212,27 @@ static unsigned int sysctl_sched_energy_aware = 1;
static DEFINE_MUTEX(sched_energy_mutex);
static bool sched_energy_update;
+/*
+ * An artificial EM (see EM_PERF_DOMAIN_ARTIFICIAL) only encodes a cost
+ * ranking between CPUs and does not claim to predict energy use at any
+ * particular OPP. Unlike a real power-based EM, its conclusions do not
+ * rely on the active governor tracking utilization when selecting
+ * frequencies, so the schedutil requirement below does not apply to it.
+ */
+static bool perf_domains_are_artificial(const struct cpumask *cpu_mask)
+{
+ int i;
+
+ for_each_cpu(i, cpu_mask) {
+ struct em_perf_domain *pd = em_cpu_get(i);
+
+ if (!pd || !em_is_artificial(pd))
+ return false;
+ }
+
+ return true;
+}
+
static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
{
bool any_asym_capacity = false;
@@ -249,7 +270,8 @@ static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
return false;
}
- if (!cpufreq_ready_for_eas(cpu_mask)) {
+ if (!cpufreq_ready_for_eas(cpu_mask) &&
+ !perf_domains_are_artificial(cpu_mask)) {
if (sched_debug()) {
pr_info("rd %*pbl: Checking EAS: cpufreq is not ready\n",
cpumask_pr_args(cpu_mask));
@@ -403,7 +425,9 @@ static void sched_energy_set(bool has_eas)
* 1. an Energy Model (EM) is available;
* 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
* 3. no SMT is detected.
- * 4. schedutil is driving the frequency of all CPUs of the rd;
+ * 4. schedutil is driving the frequency of all CPUs of the rd, or the EM
+ * of all of them is artificial (i.e. a cost ranking rather than a
+ * real power table, see EM_PERF_DOMAIN_ARTIFICIAL);
* 5. frequency invariance support is present;
*/
static bool build_perf_domains(const struct cpumask *cpu_map)
--
2.54.0