Re: [PATCH v9 08/10] sched: replace capacity_factor by usage

From: Morten Rasmussen
Date: Fri Nov 21 2014 - 07:36:35 EST


On Mon, Nov 03, 2014 at 04:54:45PM +0000, Vincent Guittot wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
> SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group
> by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
> compares this value with the sum of nr_running to decide if the group is
> overloaded or not. But the group_capacity_factor is hardly working for SMT
> system, it sometimes works for big cores but fails to do the right thing for
> little cores.
>
> Below are two examples to illustrate the problem that this patch solves:
>
> 1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
> (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
> (div_round_closest(3x640/1024) = 2) which means that it will be seen as
> overloaded even if we have only one task per CPU.
>
> 2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
> (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
> (at max and thanks to the fix [0] for SMT system that prevent the apparition
> of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
> reduced to nearly nothing), the capacity factor of the group will still be 4
> (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
>
> So, this patch tries to solve this issue by removing capacity_factor and
> replacing it with the 2 following metrics :
> -The available CPU's capacity for CFS tasks which is already used by
> load_balance.
> -The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
> has been re-introduced to compute the usage of a CPU by CFS tasks.
>
> group_capacity_factor and group_has_free_capacity has been removed and replaced
> by group_no_capacity. We compare the number of task with the number of CPUs and
> we evaluate the level of utilization of the CPUs to define if a group is
> overloaded or if a group has capacity to handle more tasks.
>
> For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
> so it will be selected in priority (among the overloaded groups). Since [1],
> SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity
> because local is not overloaded.

[...]

> @@ -6213,17 +6207,20 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> /*
> * In case the child domain prefers tasks go to siblings
> - * first, lower the sg capacity factor to one so that we'll try
> + * first, lower the sg capacity so that we'll try
> * and move all the excess tasks away. We lower the capacity
> * of a group only if the local group has the capacity to fit
> - * these excess tasks, i.e. nr_running < group_capacity_factor. The
> - * extra check prevents the case where you always pull from the
> - * heaviest group when it is already under-utilized (possible
> - * with a large weight task outweighs the tasks on the system).
> + * these excess tasks. The extra check prevents the case where
> + * you always pull from the heaviest group when it is already
> + * under-utilized (possible with a large weight task outweighs
> + * the tasks on the system).
> */
> if (prefer_sibling && sds->local &&
> - sds->local_stat.group_has_free_capacity)
> - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
> + group_has_capacity(env, &sds->local_stat) &&
> + (sgs->sum_nr_running > 1)) {
> + sgs->group_no_capacity = 1;
> + sgs->group_type = group_overloaded;
> + }

I'm still a bit confused about SD_PREFER_SIBLING. What is the flag
supposed to do and why?

It looks like a weak load balancing bias attempting to consolidate tasks
on domains with spare capacity. It does so by marking non-local groups
as overloaded regardless of their actual load if the local group has
spare capacity. Correct?

In patch 9 this behaviour is enabled for SMT level domains, which
implies that tasks will be consolidated in MC groups, that is we prefer
multiple tasks on sibling cpus (hw threads). I must be missing something
essential. I was convinced that we wanted to avoid using sibling cpus on
SMT systems as much as possible?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/