Re: [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model

From: Morten Rasmussen
Date: Thu Jun 05 2014 - 07:35:52 EST


On Thu, Jun 05, 2014 at 09:49:35AM +0100, Vincent Guittot wrote:
> Hi Morten,
>
> On 23 May 2014 20:16, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
> > This documentation patch provide a brief overview of the experimental
> > scheduler energy costing model and associated data structures.
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> > ---
> > Documentation/scheduler/sched-energy.txt | 66 ++++++++++++++++++++++++++++++
> > 1 file changed, 66 insertions(+)
> > create mode 100644 Documentation/scheduler/sched-energy.txt
> >
> > diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
> > new file mode 100644
> > index 0000000..c6896c0
> > --- /dev/null
> > +++ b/Documentation/scheduler/sched-energy.txt
> > @@ -0,0 +1,66 @@
> > +Energy cost model for energy-aware scheduling (EXPERIMENTAL)
> > +
> > +Introduction
> > +=============
> > +The basic energy model uses platform energy data stored in sched_energy data
> > +structures attached to the sched_groups in the sched_domain hierarchy. The
> > +energy cost model offers two function that can be used to guide scheduling
> > +decisions:
> > +
> > +1. energy_diff_util(cpu, util, wakeups)
>
> Could you give us mor edetails of what util and wakeups are ?
> util is a absolute value or a delta
> Is wakeups a boolean or does wakeups define a number of tasks/cpus
> that wake up ?

Good point... It is not clear at all. Improving the documentation is at
the top of my todo list.

cpu: The cpu in question.

util: Is a signed utilization delta. That is, the amount of utilization
we want to add or remove from the cpu. We don't have good metric for
utilization yet (I assume you have followed the thread on that topic
that started from your recent patch posting), so for now I have used
load_avg_contrib. energy_diff_task() just passes the task
load_avg_contrib as the utilization to energy_diff_load().

wakeups: Is the number of wakeups (task enqueues, not idle exits) caused
by the utilization we are about to add or remove from the cpu. We need
to pick some period to measure the wakeups over. For that I have
introduced task wakeup tracking, very similar to the existing load tracking.
The wakeup tracking gives us an indication of how often a task will
cause an idle exit if it ran alone on a cpu. For short but frequently
running tasks, the wakeup cost may be where the majority of the energy
is spent.

>
> > +2. energy_diff_task(cpu, task)
> > +
> > +Both return the energy cost delta caused by adding/removing utilization or a
> > +task to/from a specific cpu.
> > +
> > +CONFIG_SCHED_ENERGY needs to be defined in Kconfig to enable the energy cost
> > +model and associated data structures.
> > +
> > +The basic algorithm
> > +====================
> > +The basic idea is to determine the energy cost at each level in sched_domain
> > +hierarchy based on utilization:
> > +
> > + for_each_domain(cpu, sd) {
> > + sg = sched_group_of(cpu)
> > + energy_before = curr_util(sg) * busy_power(sg)
> > + + 1-curr_util(sg) * idle_power(sg)
> > + energy_after = new_util(sg) * busy_power(sg)
> > + + 1-new_util(sg) * idle_power(sg)
> > + + new_util(sg) * task_wakeups
> > + * wakeup_energy(sg)
> > + energy_diff += energy_before - energy_after
> > + }
> > +
> > + return energy_diff
>
> So this is the algorithm used in energy_diff_util and energy_diff_task ?

It is. energy_diff_task() is basically just a wrapper for
energy_diff_util().

> it's not straight foward for me to map the algorithm variable and the
> function argument

The pseudo-code above is very simplified. It is an attempt to show that
the algorithm goes up the sched_domain hierarhcy and estimates the
energy impact of adding/removing 'util' amount of utilization to/from
the cpu.

{curr, new}_util is the cpu utilization at the lowest level and
the overall non-idle time for the entire group for higher levels.
utilization is in the range 0.0 to 1.0.

busy_power is the power consumption of the group (for TC2, cpu at the
lowest level, cluster at the next).

idle_power is the power consumption of the group while idle (for TC2,
WFI at the lowest level, cluster power down at cluster level).

task_wakeups (should have been just 'wakeups' in the general case) is the
number of wakeups caused by the utilization we are adding/removing. To
predict how many of the wakeups that causes idle exits we scale the
number by the utilization (assuming that wakeups are uniformly
distributed). wakeup_energy is the energy consumed for an idle
exit/entry cycle for the group (for TC2, WFI at lowest level, cluster
power down at cluster level).

At each level we need to compute the energy before and after the change
to find the energy delta.

Does that answer your question?

>
> > +
> > +Platform energy data
> > +=====================
> > +struct sched_energy has the following members:
> > +
> > +cap_states:
> > + List of struct capacity_state representing the supported capacity states
> > + (P-states). struct capacity_state has two members: cap and power, which
> > + represents the compute capacity and the busy power of the state. The
> > + list must ordered by capacity low->high.
> > +
> > +nr_cap_states:
> > + Number of capacity states in cap_states.
> > +
> > +max_capacity:
> > + The highest capacity supported by any of the capacity states in
> > + cap_states.
>
> can't you directly use cap_states[nr_cap_states].cap has the array is ordered ?

Yes, indeed. max_capacity can be removed.

Morten

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/