Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement

From: Morten Rasmussen
Date: Mon Mar 16 2015 - 10:47:22 EST


On Fri, Mar 13, 2015 at 10:47:16PM +0000, Sai Gurrappadi wrote:
> On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
> > +static int energy_aware_wake_cpu(struct task_struct *p)
> > +{
> > + struct sched_domain *sd;
> > + struct sched_group *sg, *sg_target;
> > + int target_max_cap = SCHED_CAPACITY_SCALE;
> > + int target_cpu = task_cpu(p);
> > + int i;
> > +
> > + sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> > +
> > + if (!sd)
> > + return -1;
> > +
> > + sg = sd->groups;
> > + sg_target = sg;
> > + /* Find group with sufficient capacity */
> > + do {
> > + int sg_max_capacity = group_max_capacity(sg);
> > +
> > + if (sg_max_capacity >= task_utilization(p) &&
> > + sg_max_capacity <= target_max_cap) {
> > + sg_target = sg;
> > + target_max_cap = sg_max_capacity;
> > + }
> > + } while (sg = sg->next, sg != sd->groups);
>
> If a 'small' task suddenly becomes 'big' i.e close to 100% util, the
> above loop would still pick the little/small cluster because
> task_utilization(p) is upper-bounded by the arch-invariant capacity of
> the little CPU/group right?

Yes. Usually the 'big thread stuck on little cpu' problem gets
eventually resolved by load_balance() called from periodic load-balance,
idle-balance, or nohz idle-balance. But you are right, there should be
some margin added to that comparison (capacity > task_utilization(p) +
margin).

We want to have a sort of bias tunable that will affect decisions like
this one such that if you are willing to sacrifice some energy you can
put tasks on big cpus even if they could be left on a little cpu. I
think this margin above could be linked to that tunable if not the
tunable itself.

Another problem with the sched_group selection above is that we don't
consider the current utilization when choosing the group. Small tasks
end up in the 'little' group even if it is already fully utilized and
the 'big' group might have spare capacity. We could have additional
checks here, but it would add to latency and it would eventually get
sorted out by periodic/idle balance anyway. The task would have to live
with the suboptimal placement for a while though which isn't ideal.

>
> Also, this heuristic for determining sg_target is a big little
> assumption. I don't think it is necessarily correct to assume that this
> is true for all platforms. This heuristic should be derived from the
> energy model for the platform instead.

I have had the same thought, but I ended up making that assumption since
I holds for the for the few platforms I have data for and it simplified
the code a lot. I think we can potentially remove this assumption later.

>
> > +
> > + /* Find cpu with sufficient capacity */
> > + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> > + int new_usage = get_cpu_usage(i) + task_utilization(p);
>
> Isn't this double accounting the task's usage in case task_cpu(p)
> belongs to sg_target?

Again you are right. We could make the + task_utilization(p) conditional
on i != task_cpu(p). One argument against doing that is that in
select_task_rq_fair() task_utilization(p) hasn't been decayed yet while
it blocked load on the previous cpu (rq) has. If the task has been gone
for a long time, its blocked contribution may have decayed to zero and
therefore be a poor estimate of the utilization increase caused by
putting the task back on the previous cpu. Particularly if we still use
the non-decayed task_utilization(p) to estimate the utilization increase
on other cpus (!task_cpu(p)). In the interest of responsiveness and not
trying to squeeze tasks back onto the previous cpu which might soon run
out of capacity when utilization increases we could leave it as a sort
of performance bias.

In any case it deserves a comment in the code I think.

Thanks,
Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/