Re: [PATCH] sched/topology: Allow EAS without schedutil for artificial Energy Models

From: Rafael J. Wysocki (Intel)

Date: Mon Jun 29 2026 - 11:22:51 EST


On Mon, Jun 29, 2026 at 10:36 AM Lucas de Lima Nóbrega
<lucaslnobrega38@xxxxxxxxx> wrote:
>
> EAS currently refuses to enable energy-aware scheduling on a root
> domain unless schedutil is the active CPUFreq governor for all of its
> CPUs (cpufreq_ready_for_eas()). This requirement exists to protect the
> accuracy of the energy estimate: EAS predicts the OPP a CPU will run
> at from its utilization, which is only meaningful if the active
> governor actually requests OPPs that way, and schedutil is the only
> one that does.
>
> That requirement does not apply to artificial Energy Models
> (EM_PERF_DOMAIN_ARTIFICIAL). An artificial EM is built from a
> get_cost() callback instead of real power numbers, and only encodes a
> cost ranking between CPUs (e.g. P-cores cost more than E-cores at a
> given utilization). It never claims to predict real energy use at any
> specific OPP, so there is no per-OPP accuracy for the governor
> requirement to protect, regardless of which governor is in control or
> whether it tracks utilization at all.

But it is still about comparing the cost of running on different CPUs
at different performance levels.

For instance, say the scale-invariant utilization of a task is 256 and
it can run either by itself on a P-core, or with another task whose
utilization is 128 on an E-core, and say the P-core's and E-core's
capacity is 1024 and 512, respectively.

Say the cost function tells EAS that running a P-core at 1/4 of the
capacity is cheaper than running an E-core at 3/4 capacity, so it will
pick up the P-core to run that task, but if cpufreq ramps up the
frequency of the P-core to the max when the task gets to it, it may
actually turn out to be more expensive.

This means that EAS still has an expectation regarding cpufreq which
is that it will generally tend to run tasks at the performance level
corresponding to the sum of their scale-invariant utilization at least
roughly.

IIUC this actually has nothing to do with whether or not the energy
model used by EAS is artificial. The schedutil requirement is about
choosing a performance level proportional to the utilization (which
schedutil generally tends to do by design).

> intel_pstate registers exactly this kind of artificial EM for hybrid
> (P/E-core) systems without SMT, regardless of whether it operates in
> active or passive mode. In active mode it never uses schedutil, since
> HWP picks frequency autonomously, so on these systems EAS never
> engages even though SD_ASYM_CPUCAPACITY, frequency invariance and the
> EM are all in place: find_energy_efficient_cpu() is never reached
> because is_rd_overutilized() is hardcoded to true whenever
> sched_energy_enabled() is false. cppc_cpufreq registers the same kind
> of ranking-only artificial EM and is affected the same way with any
> non-schedutil governor.
>
> Allow EAS to be enabled when every CPU's EM in the root domain is
> artificial, even when schedutil is not the active governor.
>
> Tested on a Raptor Lake-P laptop with nosmt=force and intel_pstate in
> active/HWP mode: find_energy_efficient_cpu() was never called before
> this change (confirmed via the sched_overutilized_tp tracepoint and
> ftrace) and is exercised as expected afterwards.

If this is about allowing EAS to work with intel_pstate running in the
active mode, you may argue that what the processor firmware is doing
when intel_pstate runs in the active mode is not much different from
what schedutil would do. So a driver implementing an internal
governor (that is, using the .set_policy() callback) would need to
declare that its internal governor is as good as schedutil from EAS'
perspective and so it will pass the "cpufreq readiness" check.

> Signed-off-by: Lucas de Lima Nóbrega <lucaslnobrega38@xxxxxxxxx>
> ---
> Documentation/admin-guide/pm/intel_pstate.rst | 9 ++++--
> Documentation/scheduler/sched-energy.rst | 7 ++++-
> kernel/sched/topology.c | 28 +++++++++++++++++--
> 3 files changed, 38 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
> index 25fe5d88f..c8fef1e60 100644
> --- a/Documentation/admin-guide/pm/intel_pstate.rst
> +++ b/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -409,13 +409,16 @@ Energy-Aware Scheduling Support
> If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and
> ``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling
> :ref:`CAS` it registers an Energy Model for the processor. This allows the
> -Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if
> -``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate``
> -to operate in the :ref:`passive mode <passive_mode>`.
> +Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler.
>
> The Energy Model registered by ``intel_pstate`` is artificial (that is, it is
> based on abstract cost values and it does not include any real power numbers)
> and it is relatively simple to avoid unnecessary computations in the scheduler.
> +Because of that, EAS does not require ``schedutil`` to be used as the
> +``CPUFreq`` governor in this case: the cost ranking it relies on does not
> +depend on the governor tracking utilization when requesting frequencies, so
> +EAS works the same way regardless of whether ``intel_pstate`` operates in the
> +active or in the :ref:`passive mode <passive_mode>`.
> There is a performance domain in it for every CPU in the system and the cost
> values for these performance domains have been chosen so that running a task on
> a less performant (small) CPU appears to be always cheaper than running that
> diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst
> index 4e47aaf10..c23ca226d 100644
> --- a/Documentation/scheduler/sched-energy.rst
> +++ b/Documentation/scheduler/sched-energy.rst
> @@ -379,7 +379,12 @@ Consequently, the only sane governor to use together with EAS is schedutil,
> because it is the only one providing some degree of consistency between
> frequency requests and energy predictions.
>
> -Using EAS with any other governor than schedutil is not supported.
> +Using EAS with any other governor than schedutil is not supported, unless the
> +EM in use is artificial (see EM_PERF_DOMAIN_ARTIFICIAL). An artificial EM only
> +encodes a cost ranking between CPUs/OPPs instead of a real power table, so it
> +does not make any claim about energy use at a specific OPP and its conclusions
> +do not depend on the governor actually tracking utilization when requesting
> +frequencies.
>
>
> 6.5 Scale-invariant utilization signals
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5847b83d9..124a4bb4d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -212,6 +212,27 @@ static unsigned int sysctl_sched_energy_aware = 1;
> static DEFINE_MUTEX(sched_energy_mutex);
> static bool sched_energy_update;
>
> +/*
> + * An artificial EM (see EM_PERF_DOMAIN_ARTIFICIAL) only encodes a cost
> + * ranking between CPUs and does not claim to predict energy use at any
> + * particular OPP. Unlike a real power-based EM, its conclusions do not
> + * rely on the active governor tracking utilization when selecting
> + * frequencies, so the schedutil requirement below does not apply to it.
> + */
> +static bool perf_domains_are_artificial(const struct cpumask *cpu_mask)
> +{
> + int i;
> +
> + for_each_cpu(i, cpu_mask) {
> + struct em_perf_domain *pd = em_cpu_get(i);

I would do

if (!pd)
continue;

here because the CPUs without a PD simply don't matter.

Also, is any synchronization needed for this?

And should it go into the EM code?

> +
> + if (!pd || !em_is_artificial(pd))
> + return false;
> + }
> +
> + return true;
> +}
> +
> static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
> {
> bool any_asym_capacity = false;
> @@ -249,7 +270,8 @@ static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
> return false;
> }
>
> - if (!cpufreq_ready_for_eas(cpu_mask)) {
> + if (!cpufreq_ready_for_eas(cpu_mask) &&
> + !perf_domains_are_artificial(cpu_mask)) {



> if (sched_debug()) {
> pr_info("rd %*pbl: Checking EAS: cpufreq is not ready\n",
> cpumask_pr_args(cpu_mask));
> @@ -403,7 +425,9 @@ static void sched_energy_set(bool has_eas)
> * 1. an Energy Model (EM) is available;
> * 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
> * 3. no SMT is detected.
> - * 4. schedutil is driving the frequency of all CPUs of the rd;
> + * 4. schedutil is driving the frequency of all CPUs of the rd, or the EM
> + * of all of them is artificial (i.e. a cost ranking rather than a
> + * real power table, see EM_PERF_DOMAIN_ARTIFICIAL);
> * 5. frequency invariance support is present;
> */
> static bool build_perf_domains(const struct cpumask *cpu_map)
> --
> 2.54.0
>