Re: [RFC PATCH v3 03/10] PM: Introduce an Energy Model management framework

From: Quentin Perret
Date: Thu Jun 07 2018 - 11:20:12 EST

Next message: Ben Hutchings: "[PATCH 3.16 065/410] perf evlist: Introduce perf_evlist__new_dummy constructor"
Previous message: Ben Hutchings: "[PATCH 3.16 051/410] KVM: nVMX: mark vmcs12 pages dirty on L2 exit"
In reply to: Juri Lelli: "Re: [RFC PATCH v3 03/10] PM: Introduce an Energy Model management framework"
Next in thread: Dietmar Eggemann: "Re: [RFC PATCH v3 03/10] PM: Introduce an Energy Model management framework"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Juri,

On Thursday 07 Jun 2018 at 16:44:09 (+0200), Juri Lelli wrote:
> On 21/05/18 15:24, Quentin Perret wrote:
> > Several subsystems in the kernel (scheduler and/or thermal at the time
> > of writing) can benefit from knowing about the energy consumed by CPUs.
> > Yet, this information can come from different sources (DT or firmware for
> > example), in different formats, hence making it hard to exploit without
> > a standard API.
> >
> > This patch attempts to solve this issue by introducing a centralized
> > Energy Model (EM) framework which can be used to interface the data
> > providers with the client subsystems. This framework standardizes the
> > API to expose power costs, and to access them from multiple locations.
> >
> > The current design assumes that all CPUs in a frequency domain share the
> > same micro-architecture. As such, the EM data is structured in a
> > per-frequency-domain fashion. Drivers aware of frequency domains
> > (typically, but not limited to, CPUFreq drivers) are expected to register
> > data in the EM framework using the em_register_freq_domain() API. To do
> > so, the drivers must provide a callback function that will be called by
> > the EM framework to populate the tables. As of today, only the active
> > power of the CPUs is considered. For each frequency domain, the EM
> > includes a list of <frequency, power, capacity> tuples for the capacity
> > states of the domain alongside a cpumask covering the involved CPUs.
> >
> > The EM framework also provides an API to re-scale the capacity values
> > of the model asynchronously, after it has been created. This is required
> > for architectures where the capacity scale factor of CPUs can change at
> > run-time. This is the case for Arm/Arm64 for example where the
> > arch_topology driver recomputes the capacity scale factors of the CPUs
> > after the maximum frequency of all CPUs has been discovered. Although
> > complex, the process of creating and re-scaling the EM has to be kept in
> > two separate steps to fulfill the needs of the different users. The thermal
> > subsystem doesn't use the capacity values and shouldn't have dependencies
> > on subsystems providing them. On the other hand, the task scheduler needs
> > the capacity values, and it will benefit from seeing them up-to-date when
> > applicable.
> >
> > Because of this need for asynchronous update, the capacity state table
> > of each frequency domain is protected by RCU, hence guaranteeing a safe
> > modification of the table and a fast access to readers in latency-sensitive
> > code paths.
> >
> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Cc: "Rafael J. Wysocki" <rjw@xxxxxxxxxxxxx>
> > Signed-off-by: Quentin Perret <quentin.perret@xxxxxxx>
> > ---
>
> OK, I think I'll start with a few comments while I get more into
> understanding the set. :)

:-)

>
> > +static void fd_update_cs_table(struct em_cs_table *cs_table, int cpu)
> > +{
> > + unsigned long cmax = arch_scale_cpu_capacity(NULL, cpu);
> > + int max_cap_state = cs_table->nr_cap_states - 1;
> ^
> You don't need this on the stack, right?

Oh, why not ?

>
> > + unsigned long fmax = cs_table->state[max_cap_state].frequency;
> > + int i;
> > +
> > + for (i = 0; i < cs_table->nr_cap_states; i++)
> > + cs_table->state[i].capacity = cmax *
> > + cs_table->state[i].frequency / fmax;
> > +}
> > +
> > +static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
> > + struct em_data_callback *cb)
> > +{
> > + unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
> > + int i, ret, cpu = cpumask_first(span);
> > + struct em_freq_domain *fd;
> > + unsigned long power, freq;
> > +
> > + if (!cb->active_power)
> > + return NULL;
> > +
> > + fd = kzalloc(sizeof(*fd), GFP_KERNEL);
> > + if (!fd)
> > + return NULL;
> > +
> > + fd->cs_table = alloc_cs_table(nr_states);
>
> Mmm, don't you need to rcu_assign_pointer this first one as well?

Hmmmm, nobody can be using this at this point, but yes, it'd be better
to keep that consistent I suppose ...

>
> > + if (!fd->cs_table)
> > + goto free_fd;
> > +
> > + /* Copy the span of the frequency domain */
> > + cpumask_copy(&fd->cpus, span);
> > +
> > + /* Build the list of capacity states for this freq domain */
> > + for (i = 0, freq = 0; i < nr_states; i++, freq++) {
> ^ ^
> The fact that this relies on active_power() to use ceil OPP for a given
> freq might deserve a comment. Also, is this behaviour of active_power()
> standardized?

Right, this can get confusing pretty quickly. There is a comment in
include/linux/energy_model.h where the expected behaviour of
active_power is explained, but a reminder above this function shouldn't
hurt.

>
> > + ret = cb->active_power(&power, &freq, cpu);
> > + if (ret)
> > + goto free_cs_table;
> > +
> > + fd->cs_table->state[i].power = power;
> > + fd->cs_table->state[i].frequency = freq;
> > +
> > + /*
> > + * The hertz/watts efficiency ratio should decrease as the
> > + * frequency grows on sane platforms. If not, warn the user
> > + * that some high OPPs are more power efficient than some
> > + * of the lower ones.
> > + */
> > + opp_eff = freq / power;
> > + if (opp_eff >= prev_opp_eff)
> > + pr_warn("%*pbl: hz/watt efficiency: OPP %d >= OPP%d\n",
> > + cpumask_pr_args(span), i, i - 1);
> > + prev_opp_eff = opp_eff;
> > + }
> > + fd_update_cs_table(fd->cs_table, cpu);
> > +
> > + return fd;
> > +
> > +free_cs_table:
> > + free_cs_table(fd->cs_table);
> > +free_fd:
> > + kfree(fd);
> > +
> > + return NULL;
> > +}
> > +
> > +static void rcu_free_cs_table(struct rcu_head *rp)
> > +{
> > + struct em_cs_table *table;
> > +
> > + table = container_of(rp, struct em_cs_table, rcu);
> > + free_cs_table(table);
> > +}
> > +
> > +/**
> > + * em_rescale_cpu_capacity() - Re-scale capacity values of the Energy Model
> > + *
> > + * This re-scales the capacity values for all capacity states of all frequency
> > + * domains of the Energy Model. This should be used when the capacity values
> > + * of the CPUs are updated at run-time, after the EM was registered.
> > + */
> > +void em_rescale_cpu_capacity(void)
>
> So, is this thought to be called eventually also after thermal capping
> events and such?

The true reason is that the frequency domains will typically be
registered in the EM framework _before_ the arch_topology driver kicks
in on arm/arm64. That means that the EM tables are created, and only
after, the cpu capacities are updated. So we basically need to update
those capacities to be up-to-date.

The reason we need to keep those two steps separate (registering the
freq domains and re-scaling the capacities) in the EM framework is
because thermal doesn't care about the cpu capacities. It is a perfectly
acceptable configuration to use IPA without having dmips-capacity-mhz
values in the DT for ex.

Now, since we have a RCU protection on the EM tables, we might decide in
the future to use the opportunity to modify the tables at run-time for
other reasons. Thermal capping could be one I guess.

>
> > +{
> > + struct em_cs_table *old_table, *new_table;
> > + struct em_freq_domain *fd;
> > + unsigned long flags;
> > + int nr_states, cpu;
> > +
> > + read_lock_irqsave(&em_data_lock, flags);
>
> Don't you need write_lock_ here, since you are going to exchange the
> em tables?

This lock protects the per_cpu() variable itself. Here we only read
pointers from that per_cpu variable, and we modify one attribute in
the pointed structure. We don't modify the per_cpu table itself. Does
that make sense ?

>
> > + for_each_cpu(cpu, cpu_possible_mask) {
> > + fd = per_cpu(em_data, cpu);
> > + if (!fd || cpu != cpumask_first(&fd->cpus))
> > + continue;
> > +
> > + /* Copy the existing table. */
> > + old_table = rcu_dereference(fd->cs_table);
> > + nr_states = old_table->nr_cap_states;
> > + new_table = alloc_cs_table(nr_states);
> > + if (!new_table) {
> > + read_unlock_irqrestore(&em_data_lock, flags);
> > + return;
> > + }
> > + memcpy(new_table->state, old_table->state,
> > + nr_states * sizeof(*new_table->state));
> > +
> > + /* Re-scale the capacity values on the copy. */
> > + fd_update_cs_table(new_table, cpumask_first(&fd->cpus));
> > +
> > + /* Replace the table with the rescaled version. */
> > + rcu_assign_pointer(fd->cs_table, new_table);
> > + call_rcu(&old_table->rcu, rcu_free_cs_table);
> > + }
> > + read_unlock_irqrestore(&em_data_lock, flags);
> > + pr_debug("Re-scaled CPU capacities\n");
> > +}
> > +EXPORT_SYMBOL_GPL(em_rescale_cpu_capacity);
> > +
> > +/**
> > + * em_cpu_get() - Return the frequency domain for a CPU
> > + * @cpu : CPU to find the frequency domain for
> > + *
> > + * Return: the frequency domain to which 'cpu' belongs, or NULL if it doesn't
> > + * exist.
> > + */
> > +struct em_freq_domain *em_cpu_get(int cpu)
> > +{
> > + struct em_freq_domain *fd;
> > + unsigned long flags;
> > +
> > + read_lock_irqsave(&em_data_lock, flags);
> > + fd = per_cpu(em_data, cpu);
> > + read_unlock_irqrestore(&em_data_lock, flags);
> > +
> > + return fd;
> > +}
> > +EXPORT_SYMBOL_GPL(em_cpu_get);
>
> Mmm, this gets complicated pretty fast eh? :)

Yeah, hopefully I'll be able to explain/clarify that :-).

>
> I had to go back and forth between patches to start understanding the
> different data structures and how they are use, and I'm not sure yet
> I've got the full picture. I guess some nice diagram (cover letter or
> documentation patch) would help a lot.

Right, so I'd like very much to write a nice documentation patch once we
are more or less OK with the overall design of this framework, but I
felt like it was a little bit early for that. If we finally decide that
what I did is totally stupid and that it'd be better to do things
completely differently, my nice documentation patch would be a lot of
efforts for nothing.

But I agree that at the same time all this complex code has to be
explained. Hopefully the existing comments can help with that.
Otherwise, I'm more than happy to answer all questions :-)

>
> Locking of such data structures is pretty involved as well, adding
> comments/docs shouldn't harm. :)

Message received. If I do need to come-up with a brand new
design/implementation for v4, I'll make sure to add more comments.

> Best,
>
> - Juri

Thanks !
Quentin

Next message: Ben Hutchings: "[PATCH 3.16 065/410] perf evlist: Introduce perf_evlist__new_dummy constructor"
Previous message: Ben Hutchings: "[PATCH 3.16 051/410] KVM: nVMX: mark vmcs12 pages dirty on L2 exit"
In reply to: Juri Lelli: "Re: [RFC PATCH v3 03/10] PM: Introduce an Energy Model management framework"
Next in thread: Dietmar Eggemann: "Re: [RFC PATCH v3 03/10] PM: Introduce an Energy Model management framework"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]