Re: [PATCH v2] sched: make update_sd_pick_busiest return true on a busier sd

From: Peter Zijlstra
Date: Mon Jul 28 2014 - 10:30:25 EST

On Fri, Jul 25, 2014 at 03:32:10PM -0400, Rik van Riel wrote:
> Subject: sched: make update_sd_pick_busiest return true on a busier sd
> Currently update_sd_pick_busiest only identifies the busiest sd
> that is either overloaded, or has a group imbalance. When no
> sd is imbalanced or overloaded, the load balancer fails to find
> the busiest domain.
> This breaks load balancing between domains that are not overloaded,
> in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
> return true when the busiest sd yet is encountered.
> Behaviour for SD_ASYM_PACKING does not seem to match the comment,
> but I have no hardware to test that so I have left the behaviour
> of that code unchanged.
> It is unclear what to do with the group_imb condition.
> Should group_imb override a busier load? If so, should we fix
> calculate_imbalance to return a sensible number when the "busiest"
> node found has a below average load? We probably need to fix
> calculate_imbalance anyway, to deal with an overloaded group that
> happens to have a below average load...

I think we want overloaded > group_imb > other. So prefer overloaded
groups, imbalanced groups if no overloaded and anything else if no
overloaded and imbalanced thingies.

If you look at the comment near sg_imbalanced(), in that case we want to
move tasks from the first group to the second, even though the second
group would be the heaviest.

enum group_type {
group_other = 0,

static enum group_type group_classify(struct sg_lb_stats *sgs)
if (sgs->sum_nr_running > sgs->group_capacity_factor)
return group_overloaded;

if (sgs->group_imb)
return group_imbalanced;

return group_other;

> /**
> * update_sd_pick_busiest - return 1 on busiest group
> * @env: The load balancing environment.
> @@ -5957,7 +5962,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> * @sgs: sched_group statistics
> *
> * Determine if @sg is a busier group than the previously selected
> - * busiest group.
> + * busiest group.

We really need that extra trailing whitespace, yes? ;-)

> * Return: %true if @sg is a busier group than the previously selected
> * busiest group. %false otherwise.
> @@ -5967,13 +5972,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> struct sched_group *sg,
> struct sg_lb_stats *sgs)
> {

if (group_classify(sgs) < group_classify(&sds->busiest_stats))
return false;

> if (sgs->avg_load <= sds->busiest_stat.avg_load)
> return false;

> + /* This is the busiest node. */
> + if (!(env->sd->flags & SD_ASYM_PACKING))
> return true;
> /*

We could replace sg_lb_stats::group_imb with the above and avoid the
endless recomputation I suppose.

Also, we still need a little change to calculate_imbalance() where we
assume sum_nr_running > group_capacity_factor.

Attachment: pgpRgIv1sT4ff.pgp
Description: PGP signature