Re: [RFC 3/6] sched: pack small tasks

From: Vincent Guittot
Date: Mon Nov 12 2012 - 08:50:52 EST


On 9 November 2012 18:13, Morten Rasmussen <Morten.Rasmussen@xxxxxxx> wrote:
> Hi Vincent,
>
> I have experienced suboptimal buddy selection on a dual cluster setup
> (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at
> CPU level. This seems to be the correct flag settings for a system with
> only cluster level power gating.
>
> To me it looks like update_packing_domain() is not doing the right
> thing. See inline comments below.

Hi Morten,

Thanks for testing the patches.

It seems that I have too optimized the loop and remove some use cases.

>
> On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:
>> During sched_domain creation, we define a pack buddy CPU if available.
>>
>> On a system that share the powerline at all level, the buddy is set to -1
>>
>> On a dual clusters / dual cores system which can powergate each core and
>> cluster independantly, the buddy configuration will be :
>> | CPU0 | CPU1 | CPU2 | CPU3 |
>> -----------------------------------
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>
>> Small tasks tend to slip out of the periodic load balance.
>> The best place to choose to migrate them is at their wake up.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>> ---
>> kernel/sched/core.c | 1 +
>> kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/sched.h | 1 +
>> 3 files changed, 111 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index dab7908..70cadbe 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>> rcu_assign_pointer(rq->sd, sd);
>> destroy_sched_domains(tmp, cpu);
>>
>> + update_packing_domain(cpu);
>> update_top_cache_domain(cpu);
>> }
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4f4a4f6..8c9d3ed 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -157,6 +157,63 @@ void sched_init_granularity(void)
>> update_sysctl();
>> }
>>
>> +
>> +/*
>> + * Save the id of the optimal CPU that should be used to pack small tasks
>> + * The value -1 is used when no buddy has been found
>> + */
>> +DEFINE_PER_CPU(int, sd_pack_buddy);
>> +
>> +/* Look for the best buddy CPU that can be used to pack small tasks
>> + * We make the assumption that it doesn't wort to pack on CPU that share the
>> + * same powerline. We looks for the 1st sched_domain without the
>> + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest
>> + * power per core based on the assumption that their power efficiency is
>> + * better */
>> +void update_packing_domain(int cpu)
>> +{
>> + struct sched_domain *sd;
>> + int id = -1;
>> +
>> + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
>> + if (!sd)
>> + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
>> + else
>> + sd = sd->parent;
> sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched
> groups of the parent level would represent the power domains. If get it
> right, we want to pack inside the cluster first and only let first cpu

You probably wanted to use sched_group instead of cluster because
cluster is only a special use case, didn't you ?

> of the cluster do packing on another cluster. So all cpus - except the
> first one - in the current sched domain should find its buddy within the
> domain and only the first one should go to the parent sched domain to
> find its buddy.

We don't want to pack in the current sched_domain because it shares
power domain. We want to pack at the parent level

>
> I propose the following fix:
>
> - sd = sd->parent;
> + if (cpumask_first(sched_domain_span(sd)) == cpu
> + || !sd->parent)
> + sd = sd->parent;

We always look for the buddy in the parent level whatever the cpu
position in the mask is.

>
>
>> +
>> + while (sd) {
>> + struct sched_group *sg = sd->groups;
>> + struct sched_group *pack = sg;
>> + struct sched_group *tmp = sg->next;
>> +
>> + /* 1st CPU of the sched domain is a good candidate */
>> + if (id == -1)
>> + id = cpumask_first(sched_domain_span(sd));
>
> There is no guarantee that id is in the sched group pointed to by
> sd->groups, which is implicitly assumed later in the search loop. We
> need to find the sched group that contains id and point sg to that
> instead. I haven't found an elegant way to find that group, but the fix
> below should at least give the right result.
>
> + /* Find sched group of candidate */
> + tmp = sd->groups;
> + do {
> + if (cpumask_test_cpu(id, sched_group_cpus(tmp)))
> + {
> + sg = tmp;
> + break;
> + }
> + } while (tmp = tmp->next, tmp != sd->groups);
> +
> + pack = sg;
> + tmp = sg->next;


I have a new loop which solves your issue and others. I will use it
for the next version

+ while (sd) {
+ struct sched_group *sg = sd->groups;
+ struct sched_group *pack = sg;
+ struct sched_group *tmp;
+
+ /* The 1st CPU of the local group is a good candidate */
+ id = cpumask_first(sched_group_cpus(pack));
+
+ /* loop the sched groups to find the best one */
+ for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
+ if (tmp->sgp->power * pack->group_weight >
+ pack->sgp->power * tmp->group_weight)
+ continue;
+
+ if ((tmp->sgp->power * pack->group_weight ==
+ pack->sgp->power * tmp->group_weight)
+ && (cpumask_first(sched_group_cpus(tmp)) >= id))
+ continue;
+
+ /* we have found a better group */
+ pack = tmp;
+
+ /* Take the 1st CPU of the new group */
+ id = cpumask_first(sched_group_cpus(pack));
+ }
+
+ /* Look for another CPU than itself */
+ if ((id != cpu)
+ || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))
+ break;
+
+ sd = sd->parent;
+ }

Regards,
Vincent

>
> Regards,
> Morten
>
>> +
>> + /* loop the sched groups to find the best one */
>> + while (tmp != sg) {
>> + if (tmp->sgp->power * sg->group_weight <
>> + sg->sgp->power * tmp->group_weight)
>> + pack = tmp;
>> + tmp = tmp->next;
>> + }
>> +
>> + /* we have found a better group */
>> + if (pack != sg)
>> + id = cpumask_first(sched_group_cpus(pack));
>> +
>> + /* Look for another CPU than itself */
>> + if ((id != cpu)
>> + || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))
>> + break;
>> +
>> + sd = sd->parent;
>> + }
>> +
>> + pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
>> + per_cpu(sd_pack_buddy, cpu) = id;
>> +}
>> +
>> #if BITS_PER_LONG == 32
>> # define WMULT_CONST (~0UL)
>> #else
>> @@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target)
>> return target;
>> }
>>
>> +static inline bool is_buddy_busy(int cpu)
>> +{
>> + struct rq *rq = cpu_rq(cpu);
>> +
>> + /*
>> + * A busy buddy is a CPU with a high load or a small load with a lot of
>> + * running tasks.
>> + */
>> + return ((rq->avg.usage_avg_sum << rq->nr_running) >
>> + rq->avg.runnable_avg_period);
>> +}
>> +
>> +static inline bool is_light_task(struct task_struct *p)
>> +{
>> + /* A light task runs less than 25% in average */
>> + return ((p->se.avg.usage_avg_sum << 2) < p->se.avg.runnable_avg_period);
>> +}
>> +
>> +static int check_pack_buddy(int cpu, struct task_struct *p)
>> +{
>> + int buddy = per_cpu(sd_pack_buddy, cpu);
>> +
>> + /* No pack buddy for this CPU */
>> + if (buddy == -1)
>> + return false;
>> +
>> + /*
>> + * If a task is waiting for running on the CPU which is its own buddy,
>> + * let the default behavior to look for a better CPU if available
>> + * The threshold has been set to 37.5%
>> + */
>> + if ((buddy == cpu)
>> + && ((p->se.avg.usage_avg_sum << 3) < (p->se.avg.runnable_avg_sum * 5)))
>> + return false;
>> +
>> + /* buddy is not an allowed CPU */
>> + if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
>> + return false;
>> +
>> + /*
>> + * If the task is a small one and the buddy is not overloaded,
>> + * we use buddy cpu
>> + */
>> + if (!is_light_task(p) || is_buddy_busy(buddy))
>> + return false;
>> +
>> + return true;
>> +}
>> +
>> /*
>> * sched_balance_self: balance the current task (running on cpu) in domains
>> * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
>> @@ -3098,6 +3204,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>> if (p->nr_cpus_allowed == 1)
>> return prev_cpu;
>>
>> + if (check_pack_buddy(cpu, p))
>> + return per_cpu(sd_pack_buddy, cpu);
>> +
>> if (sd_flag & SD_BALANCE_WAKE) {
>> if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>> want_affine = 1;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index a95d5c1..086d8bf 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -875,6 +875,7 @@ static inline void idle_balance(int cpu, struct rq *rq)
>>
>> extern void sysrq_sched_debug_show(void);
>> extern void sched_init_granularity(void);
>> +extern void update_packing_domain(int cpu);
>> extern void update_max_interval(void);
>> extern void update_group_power(struct sched_domain *sd, int cpu);
>> extern int update_runtime(struct notifier_block *nfb, unsigned long action, void *hcpu);
>> --
>> 1.7.9.5
>>
>>
>> _______________________________________________
>> linaro-dev mailing list
>> linaro-dev@xxxxxxxxxxxxxxxx
>> http://lists.linaro.org/mailman/listinfo/linaro-dev
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/