[PATCH] sched: make sure busiest group and run queue are pullable

From: Peter Williams
Date: Fri Mar 24 2006 - 22:41:28 EST


Peter Williams wrote:
Peter Williams wrote:
Siddha, Suresh B wrote:
more issues with smpnice patch...

a) consider a 4-way system (simple SMP system with no HT and cores) scenario
where a high priority task (nice -20) is running on P0 and two normal
priority tasks running on P1. load balance with smp nice code
will never be able to detect an imbalance and hence will never move one of the normal priority tasks on P1 to idle cpus P2 or P3.

Why?

OK, I think I know why. The load balancing code will always decide that P0 is the busiest CPU, right?

Attached is a patch that addresses this problem. The strategies employed are:

1. for find_busiest_group() only consider groups that have at least one CPU with more than one task running as candidates for "busiest", and

2. for find_busiest_queue() only consider queues that have more than one running tasks as candidates for "busiest".

I think that the overhead gains from earlier abandonment of load balancing attempts that would eventually (most probably -- see next paragraph) be abandoned anyway will compensate for the extra overhead introduced in these functions.

I think that the only likely behavioural changes for an all tasks have nice==0 system is that without these checks there is a small chance that a "busiest" that only has one runnable task (and for which move_tasks() would eventually not move any tasks) when these tests are made may actually acquire extra runnable tasks before the locks are taken in preparation for calling move_tasks() and, therefore, load balancing may actually take place. I think that this effect can be safely ignored.

Signed-off-by: Peter Williams <pwil3058@xxxxxxxxxxxxxx>

Peter
--
Peter Williams pwil3058@xxxxxxxxxxxxxx

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Index: MM-2.6.X/kernel/sched.c
===================================================================
--- MM-2.6.X.orig/kernel/sched.c 2006-03-25 13:43:06.000000000 +1100
+++ MM-2.6.X/kernel/sched.c 2006-03-25 13:56:37.000000000 +1100
@@ -2115,6 +2115,7 @@ find_busiest_group(struct sched_domain *
int local_group;
int i;
unsigned long sum_nr_running, sum_weighted_load;
+ unsigned int nr_loaded_cpus = 0; /* where nr_running > 1 */

local_group = cpu_isset(this_cpu, group->cpumask);

@@ -2135,6 +2136,8 @@ find_busiest_group(struct sched_domain *

avg_load += load;
sum_nr_running += rq->nr_running;
+ if (rq->nr_running > 1)
+ ++nr_loaded_cpus;
sum_weighted_load += rq->raw_weighted_load;
}

@@ -2149,7 +2152,7 @@ find_busiest_group(struct sched_domain *
this = group;
this_nr_running = sum_nr_running;
this_load_per_task = sum_weighted_load;
- } else if (avg_load > max_load) {
+ } else if (nr_loaded_cpus && avg_load > max_load) {
max_load = avg_load;
busiest = group;
busiest_nr_running = sum_nr_running;
@@ -2258,16 +2261,16 @@ out_balanced:
static runqueue_t *find_busiest_queue(struct sched_group *group,
enum idle_type idle)
{
- unsigned long load, max_load = 0;
- runqueue_t *busiest = NULL;
+ unsigned long max_load = 0;
+ runqueue_t *busiest = NULL, *rqi;
int i;

for_each_cpu_mask(i, group->cpumask) {
- load = weighted_cpuload(i);
+ rqi = cpu_rq(i);

- if (load > max_load) {
- max_load = load;
- busiest = cpu_rq(i);
+ if (rqi->nr_running > 1 && rqi->raw_weighted_load > max_load) {
+ max_load = rqi->raw_weighted_load;
+ busiest = rqi;
}
}