[PATCH] sched: new feature to spread tasks inside cpu-groups

From: Michael wang
Date: Mon Jun 30 2014 - 03:43:44 EST


Recently testing show that the cpu-cgroup was failed on managing the mixed
workloads of dbench and stress, by doing:

mkdir /cgroup/cpu/l1/
mkdir /cgroup/cpu/l1/A
mkdir /cgroup/cpu/l1/B
mkdir /cgroup/cpu/l1/C

echo $$ > /cgroup/cpu/l1/A/tasks ; dbench 6
echo $$ > /cgroup/cpu/l1/B/tasks ; stress 6
echo $$ > /cgroup/cpu/l1/C/tasks ; stress 6

although the cpu-shares was 1:1:1 (A:B:C), the CPU% was around 1:5:5.

Now by doing:

echo 102400 > /cgroup/cpu/l1/A/cpu.shares

the cpu-shares become 100:1:1, however, the CPU% was still around 1:5:5.

This testing could be extended to 10000:1:1 on cpu-shares or even more, the
CPU% was still around 1:5:5.

We used to think it was caused by that the dbench only need so many CPU% but
actually that's not true, after we bind each instances to different CPUs, we
could see the CPU% become 3:4:4 with only 10:1:1 on cpu-shares.

However, bind tasks to each CPU is definitely not a good solution, we need
some feature capable to spread tasks inside a group meanwhile following the
current scheduler logical.

This patch introduced a new feature which will meet these requirements, it will
locate idle cfs_rq inside cpu-group when and only when we are going to giveup
on searching idle-CPU, this make the tasks more actively on spreading inside
cpu-cgroup than usual.

Now by doing:

echo SPREAD_INSIDE_GROUP > /sys/kernel/debug/sched_features

The 10:1:1 on cpu-shares will lead to 3:4:4 on CPU%, also the throughput of
dbench raised, so we finally got the way to help dbench(transaction workload)
to fight with stress(CPU-intensive workload).

CC: Ingo Molnar <mingo@xxxxxxxxxx>
CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Signed-off-by: Michael Wang <wangyun@xxxxxxxxxxxxxxxxxx>
---
kernel/sched/fair.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 8 ++++++
2 files changed, 71 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..0e3022c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4409,6 +4409,51 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
return idlest;
}

+static inline int tg_idle_cpu(struct task_group *tg, int cpu)
+{
+ return !tg->cfs_rq[cpu]->nr_running;
+}
+
+/*
+ * Try and locate an idle CPU in the sched_domain from tg's view.
+ */
+static int tg_idle_sibling(struct task_struct *p, int target)
+{
+ struct sched_domain *sd;
+ struct sched_group *sg;
+ int i = task_cpu(p);
+ struct task_group *tg = task_group(p);
+
+ if (tg_idle_cpu(tg, target))
+ goto done;
+
+ sd = rcu_dereference(per_cpu(sd_llc, target));
+ for_each_lower_domain(sd) {
+ sg = sd->groups;
+ do {
+ if (!cpumask_intersects(sched_group_cpus(sg),
+ tsk_cpus_allowed(p)))
+ goto next;
+
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ if (i == target || !tg_idle_cpu(tg, i))
+ goto next;
+ }
+
+ target = cpumask_first_and(sched_group_cpus(sg),
+ tsk_cpus_allowed(p));
+
+ goto done;
+next:
+ sg = sg->next;
+ } while (sg != sd->groups);
+ }
+
+done:
+
+ return target;
+}
+
/*
* Try and locate an idle CPU in the sched_domain.
*/
@@ -4417,6 +4462,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
struct sched_domain *sd;
struct sched_group *sg;
int i = task_cpu(p);
+ struct sched_entity *se = task_group(p)->se[i];

if (idle_cpu(target))
return target;
@@ -4451,6 +4497,23 @@ next:
} while (sg != sd->groups);
}
done:
+
+ if (!idle_cpu(target) && sched_feat(SPREAD_INSIDE_GROUP)) {
+ /*
+ * Before we arbitrarily return the target, try to locate an
+ * idle cfs_rq inside task's group with the same logical.
+ *
+ * This is try to prevent tasks from gathering, especially for
+ * those wake-affine rapidly while being balanced rarely, wakeup
+ * is the only chance to spreading them.
+ *
+ * We only need to take care the tasks flip frequently, and
+ * load-balance routine will take care the others.
+ */
+ if (p->wakee_flips > this_cpu_read(sd_llc_size))
+ return tg_idle_sibling(p, target);
+ }
+
return target;
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d1..532d6e9 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -6,6 +6,14 @@
SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)

/*
+ * Adopt the logical of select_idle_sibling() to pick idle cfs_rq
+ * inside task's cpu-group, this will help to spread the group's
+ * tasks internally and benefit to those who prefer balancing more
+ * than gathering.
+ */
+SCHED_FEAT(SPREAD_INSIDE_GROUP, false)
+
+/*
* Place new tasks ahead so that they do not starve already running
* tasks
*/
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/