Re: [RFC PATCH 0/2] sched: simplify the select_task_rq_fair()

From: Michael Wang
Date: Mon Jan 21 2013 - 22:43:19 EST


On 01/21/2013 05:44 PM, Mike Galbraith wrote:
> On Mon, 2013-01-21 at 17:22 +0800, Michael Wang wrote:
>> On 01/21/2013 05:09 PM, Mike Galbraith wrote:
>>> On Mon, 2013-01-21 at 15:45 +0800, Michael Wang wrote:
>>>> On 01/21/2013 03:09 PM, Mike Galbraith wrote:
>>>>> On Mon, 2013-01-21 at 07:42 +0100, Mike Galbraith wrote:
>>>>>> On Mon, 2013-01-21 at 13:07 +0800, Michael Wang wrote:
>>>>>
>>>>>>> May be we could try change this back to the old way later, after the aim
>>>>>>> 7 test on my server.
>>>>>>
>>>>>> Yeah, something funny is going on.
>>>>>
>>>>> Never entering balance path kills the collapse. Asking wake_affine()
>>>>> wrt the pull as before, but allowing us to continue should no idle cpu
>>>>> be found, still collapsed. So the source of funny behavior is indeed in
>>>>> balance_path.
>>>>
>>>> Below patch based on the patch set could help to avoid enter balance path
>>>> if affine_sd could be found, just like the old logical, would you like to
>>>> take a try and see whether it could help fix the collapse?
>>>
>>> No, it does not.
>>
>> Hmm...what have changed now compared to the old logical?
>
> What I did earlier to confirm the collapse originates in balance_path is
> below. I just retested to confirm.
>
> Tasks jobs/min jti jobs/min/task real cpu
> 1 435.34 100 435.3448 13.92 3.76 Mon Jan 21 10:24:00 2013
> 1 440.09 100 440.0871 13.77 3.76 Mon Jan 21 10:24:22 2013
> 1 440.41 100 440.4070 13.76 3.75 Mon Jan 21 10:24:45 2013
> 5 2467.43 99 493.4853 12.28 10.71 Mon Jan 21 10:24:59 2013
> 5 2445.52 99 489.1041 12.39 10.98 Mon Jan 21 10:25:14 2013
> 5 2475.49 99 495.0980 12.24 10.59 Mon Jan 21 10:25:27 2013
> 10 4963.14 99 496.3145 12.21 20.64 Mon Jan 21 10:25:41 2013
> 10 4959.08 99 495.9083 12.22 21.26 Mon Jan 21 10:25:54 2013
> 10 5415.55 99 541.5550 11.19 11.54 Mon Jan 21 10:26:06 2013
> 20 9934.43 96 496.7213 12.20 33.52 Mon Jan 21 10:26:18 2013
> 20 9950.74 98 497.5369 12.18 36.52 Mon Jan 21 10:26:31 2013
> 20 9893.88 96 494.6939 12.25 34.39 Mon Jan 21 10:26:43 2013
> 40 18937.50 98 473.4375 12.80 84.74 Mon Jan 21 10:26:56 2013
> 40 18996.87 98 474.9216 12.76 88.64 Mon Jan 21 10:27:09 2013
> 40 19146.92 98 478.6730 12.66 89.98 Mon Jan 21 10:27:22 2013
> 80 37610.55 98 470.1319 12.89 112.01 Mon Jan 21 10:27:35 2013
> 80 37321.02 98 466.5127 12.99 114.21 Mon Jan 21 10:27:48 2013
> 80 37610.55 98 470.1319 12.89 111.77 Mon Jan 21 10:28:01 2013
> 160 69109.05 98 431.9316 14.03 156.81 Mon Jan 21 10:28:15 2013
> 160 69505.38 98 434.4086 13.95 155.33 Mon Jan 21 10:28:29 2013
> 160 69207.71 98 432.5482 14.01 155.79 Mon Jan 21 10:28:43 2013
> 320 108033.43 98 337.6045 17.95 314.01 Mon Jan 21 10:29:01 2013
> 320 108577.83 98 339.3057 17.86 311.79 Mon Jan 21 10:29:19 2013
> 320 108395.75 98 338.7367 17.89 312.55 Mon Jan 21 10:29:37 2013
> 640 151440.84 98 236.6263 25.61 620.37 Mon Jan 21 10:30:03 2013
> 640 151440.84 97 236.6263 25.61 621.23 Mon Jan 21 10:30:29 2013
> 640 151145.75 98 236.1652 25.66 622.35 Mon Jan 21 10:30:55 2013
> 1280 190117.65 98 148.5294 40.80 1228.40 Mon Jan 21 10:31:36 2013
> 1280 189977.96 98 148.4203 40.83 1229.91 Mon Jan 21 10:32:17 2013
> 1280 189560.12 98 148.0938 40.92 1231.71 Mon Jan 21 10:32:58 2013
> 2560 217857.04 98 85.1004 71.21 2441.61 Mon Jan 21 10:34:09 2013
> 2560 217338.19 98 84.8977 71.38 2448.76 Mon Jan 21 10:35:21 2013
> 2560 217795.87 97 85.0765 71.23 2443.12 Mon Jan 21 10:36:32 2013
>
> That was with your change backed out, and the q/d below applied.

So that change will help to solve the issue? good to know :)

But it will invoke wake_affine() with out any delay, the benefit
of the patch set will be reduced a lot...

I think this change help to solve the issue because it avoid jump
into balance path when wakeup for any cases, I think we can do
some change like below to achieve this and meanwhile gain benefit
from delay wake_affine().

Since the issue could not been reproduced on my side, I don't know
whether the patch benefit or not, so if you are willing to send out
a formal patch, I would be glad to include it in my patch set ;-)

And another patch below below is a debug one, which will print out
all the sbm info, so we could check whether it was initialized
correctly, just use command "dmesg | grep WYT" to show the map.

Regards,
Michael Wang

---
kernel/sched/fair.c | 42 +++++++++++++++++++++++++-----------------
1 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2aa26c1..4361333 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3250,7 +3250,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
}

/*
- * Try and locate an idle CPU in the sched_domain.
+ * Try and locate an idle CPU in the sched_domain, return -1 if failed.
*/
static int select_idle_sibling(struct task_struct *p, int target)
{
@@ -3292,13 +3292,13 @@ static int select_idle_sibling(struct task_struct *p, int target)

target = cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
- goto done;
+ return target;
next:
sg = sg->next;
} while (sg != sd->groups);
}
-done:
- return target;
+
+ return -1;
}

/*
@@ -3342,40 +3342,48 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
* may has already been cached on prev_cpu, and usually
* they require low latency.
*
- * So firstly try to locate an idle cpu shared the cache
+ * Therefor, balance path in such case will cause damage
+ * and bring benefit synchronously, wakeup on prev_cpu
+ * may better than wakeup on a new lower load cpu for the
+ * cached memory, and we never know.
+ *
+ * So the principle is, try to find an idle cpu as close to
+ * prev_cpu as possible, if failed, just take prev_cpu.
+ *
+ * Firstly try to locate an idle cpu shared the cache
* with prev_cpu, it has the chance to break the load
* balance, fortunately, select_idle_sibling() will search
* from top to bottom, which help to reduce the chance in
* some cases.
*/
new_cpu = select_idle_sibling(p, prev_cpu);
- if (idle_cpu(new_cpu))
+ if (new_cpu != -1)
goto unlock;

/*
* No idle cpu could be found in the topology of prev_cpu,
- * before jump into the slow balance_path, try search again
- * in the topology of current cpu if it is the affine of
- * prev_cpu.
+ * before take the prev_cpu, try search again in the
+ * topology of current cpu if it is the affine of prev_cpu.
*/
- if (cpu == prev_cpu ||
- !sbm->affine_map[prev_cpu] ||
+ if (cpu == prev_cpu || !sbm->affine_map[prev_cpu] ||
!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
- goto balance_path;
+ goto take_prev;

new_cpu = select_idle_sibling(p, cpu);
- if (!idle_cpu(new_cpu))
- goto balance_path;
-
/*
* Invoke wake_affine() finally since it is no doubt a
* performance killer.
*/
- if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
+ if ((new_cpu != -1) &&
+ wake_affine(sbm->affine_map[prev_cpu], p, sync))
goto unlock;
+
+take_prev:
+ new_cpu = prev_cpu;
+ goto unlock;
}

-balance_path:
+ /* Balance path. */
new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
sd = sbm->sd[type][sbm->top_level[type]];

--
1.7.4.1

DEBUG PATCH:

---
kernel/sched/core.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c63303..f251f29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5578,6 +5578,35 @@ static void update_top_cache_domain(int cpu)
static int sbm_max_level;
DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_balance_map, sbm_array);

+static void debug_sched_balance_map(int cpu)
+{
+ int i, type, level = 0;
+ struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
+
+ printk("WYT: sbm of cpu %d\n", cpu);
+
+ for (type = 0; type < SBM_MAX_TYPE; type++) {
+ if (type == SBM_EXEC_TYPE)
+ printk("WYT: \t exec map\n");
+ else if (type == SBM_FORK_TYPE)
+ printk("WYT: \t fork map\n");
+ else if (type == SBM_WAKE_TYPE)
+ printk("WYT: \t wake map\n");
+
+ for (level = 0; level < sbm_max_level; level++) {
+ if (sbm->sd[type][level])
+ printk("WYT: \t\t sd %x, idx %d, level %d, weight %d\n", sbm->sd[type][level], level, sbm->sd[type][level]->level, sbm->sd[type][level]->span_weight);
+ }
+ }
+
+ printk("WYT: \t affine map\n");
+
+ for_each_possible_cpu(i) {
+ if (sbm->affine_map[i])
+ printk("WYT: \t\t affine with cpu %x in sd %x, weight %d\n", i, sbm->affine_map[i], sbm->affine_map[i]->span_weight);
+ }
+}
+
static void build_sched_balance_map(int cpu)
{
struct sched_balance_map *sbm = &per_cpu(sbm_array, cpu);
@@ -5688,6 +5717,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
* destroy_sched_domains() already do the work.
*/
build_sched_balance_map(cpu);
+ debug_sched_balance_map(cpu);
rcu_assign_pointer(rq->sbm, sbm);
}

--
1.7.4.1

>
> ---
> kernel/sched/fair.c | 27 ++++++---------------------
> 1 file changed, 6 insertions(+), 21 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3337,6 +3337,8 @@ select_task_rq_fair(struct task_struct *
> goto unlock;
>
> if (sd_flag & SD_BALANCE_WAKE) {
> + new_cpu = prev_cpu;
> +
> /*
> * Tasks to be waked is special, memory it relied on
> * may has already been cached on prev_cpu, and usually
> @@ -3348,33 +3350,16 @@ select_task_rq_fair(struct task_struct *
> * from top to bottom, which help to reduce the chance in
> * some cases.
> */
> - new_cpu = select_idle_sibling(p, prev_cpu);
> + new_cpu = select_idle_sibling(p, new_cpu);
> if (idle_cpu(new_cpu))
> goto unlock;
>
> - /*
> - * No idle cpu could be found in the topology of prev_cpu,
> - * before jump into the slow balance_path, try search again
> - * in the topology of current cpu if it is the affine of
> - * prev_cpu.
> - */
> - if (!sbm->affine_map[prev_cpu] ||
> - !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
> - goto balance_path;
> -
> - new_cpu = select_idle_sibling(p, cpu);
> - if (!idle_cpu(new_cpu))
> - goto balance_path;
> + if (wake_affine(sbm->affine_map[cpu], p, sync))
> + new_cpu = select_idle_sibling(p, cpu);
>
> - /*
> - * Invoke wake_affine() finally since it is no doubt a
> - * performance killer.
> - */
> - if (wake_affine(sbm->affine_map[prev_cpu], p, sync))
> - goto unlock;
> + goto unlock;
> }
>
> -balance_path:
> new_cpu = (sd_flag & SD_BALANCE_WAKE) ? prev_cpu : cpu;
> sd = sbm->sd[type][sbm->top_level[type]];
>
>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/