Re: [PATCH 0/3][RFC] Improve load balancing when tasks have largeweight differential

From: Nikhil Rao
Date: Wed Sep 29 2010 - 15:32:35 EST


On Tue, Sep 28, 2010 at 6:45 PM, Mike Galbraith <efault@xxxxxx> wrote:
> On Tue, 2010-09-28 at 14:15 -0700, Nikhil Rao wrote:
>
>> Thanks for running this. I've not been able to reproduce what you are
>> seeing on the few test machines that I have (different combinations of
>> MC, CPU and NODE domains). Can you please give me more info about
>> your setup?
>
> It's a plain-jane Q6600 box, so has only MC and CPU domains.
>
> It doesn't necessarily _instantly_ "stick", can take a couple tries, or
> a little time.

The closest I have is a quad-core dual-socket machine (MC, CPU
domains). And I'm having trouble reproducing it on that machine as
well :-( I ran 5 soaker threads (one of them niced to -15) for a few
hours and didn't see the problem. Can you please give me some trace
data & schedstats to work with?

Looking at the patch/code, I suspect active migration on the CPU
scheduling domain pushes the nice 0 task (running on the same socket
as the nice -15 task) to the other socket. This leaves you with an
idle core on the nice -15 socket, and with soaker threads there is no
way to come back to a 100% utilized state. One possible explanation is
the group capacity for a sched group in the CPU sched domain is
rounded to 1 (instead of 2). I have a patch below that throws a hammer
at the problem and uses group weight instead of group capacity (this
is experimental, will refine it if it works). Can you please see if
that solves the problem?

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 6d934e8..3fdd669 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2057,6 +2057,7 @@ struct sg_lb_stats {
unsigned long sum_nr_running; /* Nr tasks running in the group */
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long group_capacity;
+ unsigned long group_weight;
int group_imb; /* Is there an imbalance in the group ? */
};

@@ -2458,6 +2459,8 @@ static inline void update_sg_lb_stats(struct
sched_domain *sd,
DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
if (!sgs->group_capacity)
sgs->group_capacity = fix_small_capacity(sd, group);
+
+ sgs->group_weight = cpumask_weight(sched_group_cpus(group));
}

/**
@@ -2480,6 +2483,9 @@ static bool update_sd_pick_busiest(struct
sched_domain *sd,
if (sgs->avg_load <= sds->max_load)
return false;

+ if (sgs->sum_nr_running <= sgs->group_weight)
+ return false;
+
if (sgs->sum_nr_running > sgs->group_capacity)
return true;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/