Re: [PATCH] sched: balance_cpu to consider other cpus in its groupas target of (pinned) task migration

From: Peter Zijlstra
Date: Mon Jun 04 2012 - 05:01:20 EST


On Mon, 2012-06-04 at 11:27 +0530, Prashanth Nageshappa wrote:
> Based on the description in
> http://marc.info/?l=linux-kernel&m=133108682113018&w=2 , I was able to recreate
> a problem where in a SCHED_OTHER thread never gets runtime, even though there is
> one allowed CPU where it can run and make progress.

Do not use external references in changelogs if you don't absolutely
need to.

> On a dual socket box (4 cores per socket, 2 threads per core) with following
> config:
> 0 8 1 9 4 12 5 13
> 2 10 3 11 6 14 7 15
> |__________| |__________|
> socket 1 socket 2
>
> If we have following 4 tasks (2 SCHED_FIFO and 2 SCHED_OTHER) started in the
> following order:
> 1> SCHED_FIFO cpu hogging task bound to cpu 1
> 2> SCHED_OTHER cpu hogging task bound to cpus 3 & 9 - running on cpu 3
> sleeps and wakes up after all other tasks are started
> 3> SCHED_FIFO cpu hogging task bound to cpu 3
> 4> SCHED_OTHER cpu hogging task bound to cpu 9
>
> Once all the 4 tasks are started, we observe that 2nd task is starved of CPU
> after waking up. When it wakes up, it wakes up on its prev_cpu (3) where
> a FIFO task is now hogging the cpu. To prevent starvation, 2nd task
> needs to be pulled to cpu 9. However, between cpus 1, 9, cpu1 is the chosen
> cpu that attempts pulling tasks towards its core. When it tries pulling
> 2nd tasks towards its core, it is unable to do so as cpu1 is not in 2nd
> task's cpus_allowed mask. Ideally cpu1 should note that the task can be
> moved to its sibling and trigger that movement.
>
> In this patch, we try to identify if load balancing goal was not achieved
> completely because of destination cpu not being in cpus_allowed mask of target
> task(s) and retry load balancing to try and move tasks to other cpus in the
> same sched_group as that of destination cpu.

Either expand a little more on the implementation or preferably add some
comments.

> Tested on tip commit cca44889.

This is not a sane addition to the changelog.

> Signed-off-by: Srivatsa Vaddagiri <vatsa@xxxxxxxxxxxxxxxxxx>

Did vatsa write this patch? If so, you forgot a From header, if not,
wtf!?

> Signed-off-by: Prashanth Nageshappa <prashanth@xxxxxxxxxxxxxxxxxx>
>
> ----
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index de49ed5..da275d8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3098,6 +3098,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
>
> #define LBF_ALL_PINNED 0x01
> #define LBF_NEED_BREAK 0x02
> +#define LBF_NEW_DST_CPU 0x04
>
> struct lb_env {
> struct sched_domain *sd;
> @@ -3108,6 +3109,8 @@ struct lb_env {
> int dst_cpu;
> struct rq *dst_rq;
>
> + struct cpumask *dst_grpmask;
> + int new_dst_cpu;
> enum cpu_idle_type idle;
> long imbalance;
> unsigned int flags;
> @@ -3198,7 +3201,25 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> * 3) are cache-hot on their current CPU.
> */
> if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
> - schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + int new_dst_cpu;
> +
> + if (!env->dst_grpmask) {
> + schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + return 0;
> + }
> + /*
> + * check if cpus_allowed has any cpus in the same sched_group
> + * as that of dst_cpu and set LBF_NEW_DST_CPU and new_dst_cpu
> + * accordingly
> + */
> + new_dst_cpu = cpumask_first_and(env->dst_grpmask,
> + tsk_cpus_allowed(p));
> + if (new_dst_cpu >= nr_cpu_ids) {
> + schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
> + } else {
> + env->flags |= LBF_NEW_DST_CPU;
> + env->new_dst_cpu = new_dst_cpu;
> + }

Only a flimsy comment on what the code does -- I can read C thank you.
Not a single comment on why it does this.

> return 0;
> }
> env->flags &= ~LBF_ALL_PINNED;
> @@ -4418,7 +4439,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> struct sched_domain *sd, enum cpu_idle_type idle,
> int *balance)
> {
> - int ld_moved, active_balance = 0;
> + int ld_moved, old_ld_moved, active_balance = 0;
> struct sched_group *group;
> struct rq *busiest;
> unsigned long flags;
> @@ -4428,6 +4449,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> .sd = sd,
> .dst_cpu = this_cpu,
> .dst_rq = this_rq,
> + .dst_grpmask = sched_group_cpus(sd->groups),
> .idle = idle,
> .loop_break = sched_nr_migrate_break,
> .find_busiest_queue = find_busiest_queue,
> @@ -4461,6 +4483,7 @@ redo:
> schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>
> ld_moved = 0;
> + old_ld_moved = 0;
> if (busiest->nr_running > 1) {
> /*
> * Attempt to move tasks. If find_busiest_group has found
> @@ -4488,12 +4511,27 @@ more_balance:
> env.flags &= ~LBF_NEED_BREAK;
> goto more_balance;
> }
> -
> /*
> * some other cpu did the load balance for us.
> */
> - if (ld_moved && this_cpu != smp_processor_id())
> - resched_cpu(this_cpu);
> + if ((ld_moved != old_ld_moved) &&
> + (env.dst_cpu != smp_processor_id()))
> + resched_cpu(env.dst_cpu);

The whole old_ld_moved stuff sucks, and preferably needs a rename, or at
least a comment.

> + if ((env.flags & LBF_NEW_DST_CPU) && (env.imbalance > 0)) {
> + /*
> + * we could not balance completely as some tasks
> + * were not allowed to move to the dst_cpu, so try
> + * again with new_dst_cpu.
> + */
> + this_rq = cpu_rq(env.new_dst_cpu);
> + env.dst_rq = this_rq;
> + env.dst_cpu = env.new_dst_cpu;
> + env.flags &= ~LBF_NEW_DST_CPU;
> + env.loop = 0;
> + old_ld_moved = ld_moved;
> + goto more_balance;
> + }
>

OK, so previously we only pulled to ourselves, now you make cpu x move
from cpu y to cpu z. This changes the dynamic of the load-balancer, not
a single word on that and its impact/ramifications.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/