Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group

From: Morten Rasmussen
Date: Mon Mar 16 2015 - 10:15:34 EST


On Fri, Mar 13, 2015 at 10:54:25PM +0000, Sai Gurrappadi wrote:
> On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
> > For energy-aware load-balancing decisions it is necessary to know the
> > energy consumption estimates of groups of cpus. This patch introduces a
> > basic function, sched_group_energy(), which estimates the energy
> > consumption of the cpus in the group and any resources shared by the
> > members of the group.
> >
> > NOTE: The function has five levels of identation and breaks the 80
> > character limit. Refactoring is necessary.
> >
> > cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> > ---
> > kernel/sched/fair.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 143 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 872ae0e..d12aa63 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4609,6 +4609,149 @@ static inline bool energy_aware(void)
> > return sched_feat(ENERGY_AWARE);
> > }
> >
> > +/*
> > + * cpu_norm_usage() returns the cpu usage relative to it's current capacity,
> > + * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
> > + * energy calculations. Using the scale-invariant usage returned by
> > + * get_cpu_usage() and approximating scale-invariant usage by:
> > + *
> > + * usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time
> > + *
> > + * the normalized usage can be found using capacity_curr.
> > + *
> > + * capacity_curr = capacity_orig * curr_freq/max_freq
> > + *
> > + * norm_usage = running_time/time ~ usage/capacity_curr
> > + */
> > +static inline unsigned long cpu_norm_usage(int cpu)
> > +{
> > + unsigned long capacity_curr = capacity_curr_of(cpu);
> > +
> > + return (get_cpu_usage(cpu) << SCHED_CAPACITY_SHIFT)/capacity_curr;
> > +}
> > +
> > +static unsigned group_max_usage(struct sched_group *sg)
> > +{
> > + int i;
> > + int max_usage = 0;
> > +
> > + for_each_cpu(i, sched_group_cpus(sg))
> > + max_usage = max(max_usage, get_cpu_usage(i));
> > +
> > + return max_usage;
> > +}
> > +
> > +/*
> > + * group_norm_usage() returns the approximated group usage relative to it's
> > + * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in
> > + * energy calculations. Since task executions may or may not overlap in time in
> > + * the group the true normalized usage is between max(cpu_norm_usage(i)) and
> > + * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The
> > + * latter is used as the estimate as it leads to a more pessimistic energy
> > + * estimate (more busy).
> > + */
> > +static unsigned group_norm_usage(struct sched_group *sg)
> > +{
> > + int i;
> > + unsigned long usage_sum = 0;
> > +
> > + for_each_cpu(i, sched_group_cpus(sg))
> > + usage_sum += cpu_norm_usage(i);
> > +
> > + if (usage_sum > SCHED_CAPACITY_SCALE)
> > + return SCHED_CAPACITY_SCALE;
> > + return usage_sum;
> > +}
> > +
> > +static int find_new_capacity(struct sched_group *sg,
> > + struct sched_group_energy *sge)
> > +{
> > + int idx;
> > + unsigned long util = group_max_usage(sg);
> > +
> > + for (idx = 0; idx < sge->nr_cap_states; idx++) {
> > + if (sge->cap_states[idx].cap >= util)
> > + return idx;
> > + }
> > +
> > + return idx;
> > +}
> > +
> > +/*
> > + * sched_group_energy(): Returns absolute energy consumption of cpus belonging
> > + * to the sched_group including shared resources shared only by members of the
> > + * group. Iterates over all cpus in the hierarchy below the sched_group starting
> > + * from the bottom working it's way up before going to the next cpu until all
> > + * cpus are covered at all levels. The current implementation is likely to
> > + * gather the same usage statistics multiple times. This can probably be done in
> > + * a faster but more complex way.
> > + */
> > +static unsigned int sched_group_energy(struct sched_group *sg_top)
> > +{
> > + struct sched_domain *sd;
> > + int cpu, total_energy = 0;
> > + struct cpumask visit_cpus;
> > + struct sched_group *sg;
> > +
> > + WARN_ON(!sg_top->sge);
> > +
> > + cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
> > +
> > + while (!cpumask_empty(&visit_cpus)) {
> > + struct sched_group *sg_shared_cap = NULL;
> > +
> > + cpu = cpumask_first(&visit_cpus);
> > +
> > + /*
> > + * Is the group utilization affected by cpus outside this
> > + * sched_group?
> > + */
> > + sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
> > + if (sd && sd->parent)
> > + sg_shared_cap = sd->parent->groups;
>
> The above bit looks like it avoids supporting SD_SHARE_CAP_STATES for
> the top level sd (!sd->parent). Is it because there is no group that
> spans all the CPUs spanned by this sd? It seems like sg_cap is just
> being used as a proxy for the cpumask of CPUs to check for max_usage.

You are absolutely right. The current code is broken for system
topologies where all cpus share the same clock source. To be honest, it
is actually worse than that and you already pointed out the reason. We
don't have a way of representing top level contributions to power
consumption in RFCv3, as we don't have sched_group spanning all cpus in
single cluster system. For example, we can't represent L2 cache and
interconnect power consumption on such systems.

In RFCv2 we had a system wide sched_group dangling by itself for that
purpose. We chose to remove that in this rewrite as it led to messy
code. In my opinion, a more elegant solution is to introduce an
additional sched_domain above the current top level which has a single
sched_group spanning all cpus in the system. That should fix the
SD_SHARE_CAP_STATES problem and allow us to attach power data for the
top level.

It is on the todo list to add that extra sched_domain/group. In the
meantime a workaround could be to just use the domain mask instead for
checking max_usage.

Thanks,
Morten

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/