Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
From: Sai Gurrappadi
Date: Wed Mar 18 2015 - 16:21:38 EST
On 03/16/2015 07:47 AM, Morten Rasmussen wrote:
> On Fri, Mar 13, 2015 at 10:47:16PM +0000, Sai Gurrappadi wrote:
>> On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
>>> +static int energy_aware_wake_cpu(struct task_struct *p)
>>> +{
>>> + struct sched_domain *sd;
>>> + struct sched_group *sg, *sg_target;
>>> + int target_max_cap = SCHED_CAPACITY_SCALE;
>>> + int target_cpu = task_cpu(p);
>>> + int i;
>>> +
>>> + sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
>>> +
>>> + if (!sd)
>>> + return -1;
>>> +
>>> + sg = sd->groups;
>>> + sg_target = sg;
>>> + /* Find group with sufficient capacity */
>>> + do {
>>> + int sg_max_capacity = group_max_capacity(sg);
>>> +
>>> + if (sg_max_capacity >= task_utilization(p) &&
>>> + sg_max_capacity <= target_max_cap) {
>>> + sg_target = sg;
>>> + target_max_cap = sg_max_capacity;
>>> + }
>>> + } while (sg = sg->next, sg != sd->groups);
>>
>> If a 'small' task suddenly becomes 'big' i.e close to 100% util, the
>> above loop would still pick the little/small cluster because
>> task_utilization(p) is upper-bounded by the arch-invariant capacity of
>> the little CPU/group right?
>
> Yes. Usually the 'big thread stuck on little cpu' problem gets
> eventually resolved by load_balance() called from periodic load-balance,
> idle-balance, or nohz idle-balance. But you are right, there should be
> some margin added to that comparison (capacity > task_utilization(p) +
> margin).
>
> We want to have a sort of bias tunable that will affect decisions like
> this one such that if you are willing to sacrifice some energy you can
> put tasks on big cpus even if they could be left on a little cpu. I
> think this margin above could be linked to that tunable if not the
> tunable itself.
>
> Another problem with the sched_group selection above is that we don't
> consider the current utilization when choosing the group. Small tasks
> end up in the 'little' group even if it is already fully utilized and
> the 'big' group might have spare capacity. We could have additional
> checks here, but it would add to latency and it would eventually get
> sorted out by periodic/idle balance anyway. The task would have to live
> with the suboptimal placement for a while though which isn't ideal.
>
>>
>> Also, this heuristic for determining sg_target is a big little
>> assumption. I don't think it is necessarily correct to assume that this
>> is true for all platforms. This heuristic should be derived from the
>> energy model for the platform instead.
>
> I have had the same thought, but I ended up making that assumption since
> I holds for the for the few platforms I have data for and it simplified
> the code a lot. I think we can potentially remove this assumption later.
>
>>
>>> +
>>> + /* Find cpu with sufficient capacity */
>>> + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
>>> + int new_usage = get_cpu_usage(i) + task_utilization(p);
>>
>> Isn't this double accounting the task's usage in case task_cpu(p)
>> belongs to sg_target?
>
> Again you are right. We could make the + task_utilization(p) conditional
> on i != task_cpu(p). One argument against doing that is that in
> select_task_rq_fair() task_utilization(p) hasn't been decayed yet while
> it blocked load on the previous cpu (rq) has. If the task has been gone
> for a long time, its blocked contribution may have decayed to zero and
> therefore be a poor estimate of the utilization increase caused by
> putting the task back on the previous cpu. Particularly if we still use
> the non-decayed task_utilization(p) to estimate the utilization increase
> on other cpus (!task_cpu(p)). In the interest of responsiveness and not
> trying to squeeze tasks back onto the previous cpu which might soon run
> out of capacity when utilization increases we could leave it as a sort
> of performance bias.
>
> In any case it deserves a comment in the code I think.
I think it makes sense to use the non-decayed value of the the task's
contrib. on wake but I am not sure if we should do this 2x accounting
all the time.
Another slightly related issue is that NOHZ could cause blocked rq sums
to remain stale for long periods if there aren't frequent enough
idle/nohz-idle-balances. This would cause the above bit and
energy_diff() to compute incorrect values.
-Sai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/