Re: [git] CFS-devel, group scheduler, fixes

From: Siddha, Suresh B
Date: Wed Sep 19 2007 - 15:36:43 EST


On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:
> This patch attempts to improve CFS's SMP global fairness based on the new
> virtual time design.
>
> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
>
> Modified small_imbalance logic in find_busiest_group(). If there's small
> imbalance, move tasks from busiest to local sched_group only if the local
> group contains a CPU whose min_vruntime is the maximum among all CPUs in
> the same sched_domain. This prevents any CPU from advancing too far ahead
> in virtual time and avoids tasks thrashing between two CPUs without
> utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs,
> since the load is not evenly divisible by the number of CPUs, we want the
> extra load to have a fair use of every CPU in the system.
>
> Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task

Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue
even with this patch.

> if (*imbalance < busiest_load_per_task) {
> - unsigned long tmp, pwr_now, pwr_move;
> - unsigned int imbn;
> -
> small_imbalance:
> - pwr_move = pwr_now = 0;
> - imbn = 2;
> - if (this_nr_running) {
> - this_load_per_task /= this_nr_running;
> - if (busiest_load_per_task > this_load_per_task)
> - imbn = 1;
> - } else
> - this_load_per_task = SCHED_LOAD_SCALE;
> -
> - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
> - busiest_load_per_task * imbn) {
> - *imbalance = busiest_load_per_task;
> - return busiest;
> - }

This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC
optimizations.

> -
> - /*
> - * OK, we don't have enough imbalance to justify moving
> tasks,
> - * however we may be able to increase total CPU power used by
> - * moving them.
> + /*
> + * When there's small imbalance, move tasks only if this
> + * sched_group contains a CPU whose min_vruntime is the
> + * maximum among all CPUs in the same domain.
> */
> -
> - pwr_now += busiest->__cpu_power *
> - min(busiest_load_per_task, max_load);
> - pwr_now += this->__cpu_power *
> - min(this_load_per_task, this_load);
> - pwr_now /= SCHED_LOAD_SCALE;
> -
> - /* Amount of load we'd subtract */
> - tmp = sg_div_cpu_power(busiest,
> - busiest_load_per_task * SCHED_LOAD_SCALE);
> - if (max_load > tmp)
> - pwr_move += busiest->__cpu_power *
> - min(busiest_load_per_task, max_load - tmp);
> -
> - /* Amount of load we'd add */
> - if (max_load * busiest->__cpu_power <
> - busiest_load_per_task * SCHED_LOAD_SCALE)
> - tmp = sg_div_cpu_power(this,
> - max_load * busiest->__cpu_power);
> - else
> - tmp = sg_div_cpu_power(this,
> - busiest_load_per_task * SCHED_LOAD_SCALE);
> - pwr_move += this->__cpu_power *
> - min(this_load_per_task, this_load + tmp);
> - pwr_move /= SCHED_LOAD_SCALE;
> -
> - /* Move if we gain throughput */
> - if (pwr_move > pwr_now)
> + if (max_vruntime_group == this)
> *imbalance = busiest_load_per_task;
> + else
> + *imbalance = 0;

Not sure how this all interacts when some of the cpu's are idle. I have to
look more closely.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/