Re: [PATCH RFC 6/7] sched: cfs: cpu frequency scaling based on task placement

From: Dietmar Eggemann
Date: Mon Oct 27 2014 - 13:42:10 EST


Hi Mike,

On 22/10/14 07:07, Mike Turquette wrote:
> {en,de}queue_task_fair are updated to track which cpus will have changed
> utilization values as function of task queueing.

The sentence is a little bit misleading. We update the se utilization
contrib and the cfs_rq utilization in {en,de}queue_task_fair for a
specific se and a specific cpu = rq_of(cfs_rq_of(se))->cpu .

> The affected cpus are
> passed on to arch_eval_cpu_freq for further machine-specific processing
> based on a selectable policy.

I'm not sure if separating the evaluation and the setting of the cpu
frequency makes sense. You could evaluate and possibly set the cpu
frequency in one go. Right now you evaluate if the cfs_rq utilization
exceeds the thresholds for the current index every time a task is
enqueued or dequeued but that's not necessary since you only try to set
the cpu frequency in the softirq. The history (and the future if we
consider blocked utilization) is already captured in the cfs_rq
utilization itself.

>
> arch_scale_cpu_freq is called from run_rebalance_domains as a way to
> kick off the scaling process (via wake_up_process), so as to prevent
> re-entering the {en,de}queue code.

The name is misleading from the viewpoint of the CFS sched class. The
original scaling function of the CFS scheduler
(arch_scale_{freq,smt/cpu,rt}_capacity) scale capacity based on
frequency, uarch or rt. So your function should be call
arch_scale_util_cpu_freq or even better arch_set_cpu_freq.

>
> All of the call sites in this patch are up for discussion. Does it make
> sense to track which cpus have updated statistics in enqueue_fair_task?

Not really because cfs_rq utilization tracks the history/(future) of cpu
utilization and you can evaluate the signal when you want to set the cpu
frequency.

> I chose this because I wanted to gather statistics for all cpus affected
> in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
> next version of this patch will focus on the simpler case of not using
> scheduler cgroups, which should remove a good chunk of this code,
> including the cpumask stuff.

I don't understand why you should care about task groups at all. The
task groups contribution to the utilization of a cpu should be already
encountered for in the appropriate cpu's cfs_rq utilization signal.

But I can see a dependency to the fact that there is a difference
between systems with per-cluster (package) or per-cpu frequency scaling.
But there is no SD_SHARE_FREQDOMAIN (sched domain flag) today which
applied to the SD level MC could tell you tahts we deal with per-cluster
frequency scaling though.
On systems with per-cpu frequency scaling you can set the frequency for
individual cpus by hooking into the scheduler but for systems with
per-cluster frequency scaling, you would have to respect the maximum cpu
utilization of all cpus in the cluster.

A similar problem occurs with hardware threads (SMT sd level).

But I don't know right now how the sd topology hierarchy can become
handy here.

>
> Also discussed at LPC14 is that fact that load_balance is a very
> interesting place to do this as frequency can be considered in concert
> with task placement. Please put forth any ideas on a sensible way to do
> this.
>
> Is run_rebalance_domains a logical place to change cpu frequency? What
> other call sites make sense?

At least it's a good place to test this feature for now.

>
> Even for platforms that can target a cpu frequency without sleeping
> (x86, some ARM platforms with PM microcontrollers) it is currently
> necessary to always kick the frequency target work out into a kthread.
> This is because of the rw_sem usage in the cpufreq core which might
> sleep. Replacing that lock type is probably a good idea.
>
> Not-signed-off-by: Mike Turquette <mturquette@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1af6f6d..3619f63 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> {
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
> + struct cpumask update_cpus;
> +
> + cpumask_clear(&update_cpus);
>
> for_each_sched_entity(se) {
> if (se->on_rq)
> @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> update_cfs_shares(cfs_rq);
> update_entity_load_avg(se, 1);
> + /* track cpus that need to be re-evaluated */
> + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
> }
>
> + /* !CONFIG_FAIR_GROUP_SCHED */
> if (!se) {
> update_rq_runnable_avg(rq, rq->nr_running);
> add_nr_running(rq, 1);
> +
> + /*
> + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
> + * typedef update_cpus into an int and skip all of the cpumask
> + * stuff
> + */
> + cpumask_set_cpu(cpu_of(rq), &update_cpus);
> }
> +
> + if (energy_aware())
> + if (!cpumask_empty(&update_cpus))
> + arch_eval_cpu_freq(&update_cpus);
> +
> hrtick_update(rq);
> }
>
> @@ -4049,6 +4067,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
> int task_sleep = flags & DEQUEUE_SLEEP;
> + struct cpumask update_cpus;
> +
> + cpumask_clear(&update_cpus);
>
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> @@ -4089,12 +4110,27 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> update_cfs_shares(cfs_rq);
> update_entity_load_avg(se, 1);
> + /* track runqueues/cpus that need to be re-evaluated */
> + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
> }
>
> + /* !CONFIG_FAIR_GROUP_SCHED */
> if (!se) {
> sub_nr_running(rq, 1);
> update_rq_runnable_avg(rq, 1);
> +
> + /*
> + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
> + * typedef update_cpus into an int and skip all of the cpumask
> + * stuff
> + */
> + cpumask_set_cpu(cpu_of(rq), &update_cpus);
> }
> +
> + if (energy_aware())
> + if (!cpumask_empty(&update_cpus))
> + arch_eval_cpu_freq(&update_cpus);
> +
> hrtick_update(rq);
> }
>
> @@ -7536,6 +7572,9 @@ static void run_rebalance_domains(struct softirq_action *h)
> * stopped.
> */
> nohz_idle_balance(this_rq, idle);
> +
> + if (energy_aware())
> + arch_scale_cpu_freq();
> }
>
> /*
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/