Re: [RFC PATCH v3 03/10] PM: Introduce an Energy Model management framework

From: Dietmar Eggemann
Date: Wed Jun 06 2018 - 09:12:28 EST


On 05/21/2018 04:24 PM, Quentin Perret wrote:
> Several subsystems in the kernel (scheduler and/or thermal at the time
> of writing) can benefit from knowing about the energy consumed by CPUs.
> Yet, this information can come from different sources (DT or firmware for
> example), in different formats, hence making it hard to exploit without
> a standard API.
>
> This patch attempts to solve this issue by introducing a centralized
> Energy Model (EM) framework which can be used to interface the data
> providers with the client subsystems. This framework standardizes the
> API to expose power costs, and to access them from multiple locations.
>
> The current design assumes that all CPUs in a frequency domain share the
> same micro-architecture. As such, the EM data is structured in a
> per-frequency-domain fashion. Drivers aware of frequency domains
> (typically, but not limited to, CPUFreq drivers) are expected to register
> data in the EM framework using the em_register_freq_domain() API. To do
> so, the drivers must provide a callback function that will be called by
> the EM framework to populate the tables. As of today, only the active
> power of the CPUs is considered. For each frequency domain, the EM
> includes a list of <frequency, power, capacity> tuples for the capacity
> states of the domain alongside a cpumask covering the involved CPUs.
>
> The EM framework also provides an API to re-scale the capacity values
> of the model asynchronously, after it has been created. This is required
> for architectures where the capacity scale factor of CPUs can change at
> run-time. This is the case for Arm/Arm64 for example where the
> arch_topology driver recomputes the capacity scale factors of the CPUs
> after the maximum frequency of all CPUs has been discovered. Although
> complex, the process of creating and re-scaling the EM has to be kept in
> two separate steps to fulfill the needs of the different users. The thermal
> subsystem doesn't use the capacity values and shouldn't have dependencies
> on subsystems providing them. On the other hand, the task scheduler needs
> the capacity values, and it will benefit from seeing them up-to-date when
> applicable.
>
> Because of this need for asynchronous update, the capacity state table
> of each frequency domain is protected by RCU, hence guaranteeing a safe
> modification of the table and a fast access to readers in latency-sensitive
> code paths.
>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: "Rafael J. Wysocki" <rjw@xxxxxxxxxxxxx>
> Signed-off-by: Quentin Perret <quentin.perret@xxxxxxx>

[...]

> +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> +{
> + unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> + int max_cap_state = cs_table->nr_cap_states - 1;
> + unsigned long fmax = cs_table->state[max_cap_state].frequency;
> + int i;
> +
> + for (i = 0; i < cs_table->nr_cap_states; i++)
> + cs_table->state[i].capacity = cmax *
> + cs_table->state[i].frequency / fmax;
> +}

This has issues on a 32bit system. cs_table->state[i].capacity (unsigned
long) overflows with the frequency values stored in Hz.

Maybe something like this to cure it:

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 6ad53f1cf7e6..c13b3eb8bf35 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -144,9 +144,11 @@ static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
unsigned long fmax = cs_table->state[max_cap_state].frequency;
int i;

- for (i = 0; i < cs_table->nr_cap_states; i++)
- cs_table->state[i].capacity = cmax *
- cs_table->state[i].frequency / fmax;
+ for (i = 0; i < cs_table->nr_cap_states; i++) {
+ u64 val = (u64)cmax * cs_table->state[i].frequency;
+ do_div(val, fmax);
+ cs_table->state[i].capacity = (unsigned long)val;
+ }
}

This brings me to another question. Let's say there are multiple users of
the Energy Model in the system. Shouldn't the units of frequency and power
not standardized, maybe Mhz and mW?
The task scheduler doesn't care since it is only interested in power diffs
but other user might do.

[...]