Re: [patch] sched: fix uneven per-cpu task_group share distribution

From: Peter Zijlstra
Date: Wed Dec 17 2008 - 03:38:30 EST


On Mon, 2008-12-15 at 23:37 -0800, Ken Chen wrote:
> While testing CFS scheduler on linux-2.6-tip tree
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
>
> We found that task which is pinned to a CPU could be starved relative to its
> allocated fair share.
>
> The per-cpu sched_enetity load share calculation in tg_shares_up /
> update_group_shares_cpu distributes total task_group's share among all CPUs
> for a given SD domain, this would dilute task_group's per-cpu share because
> it distributes share to CPU that even has no load. The trapped share is now
> un-consumable and it leads to fair share starvation on the runnable CPU.
> Peter was right that it is still required for the low level function to make
> distinction between a boosted share that don't have any load and actual tg
> share that should be distributed among CPUs in which the tg is running.
>
> Patch to add that boost and we think the scheduler should only boost one
> times of tg shares over all empty CPU that don't have any load for the
> specific task_group in order to bound maximum temporary boost that a given
> task_group can have.

Yeah I was worried about you removing that boost, but since you seemed
to be doing serious testing on the thing.. A well, this sounds good.

> Signed-off-by: Ken Chen <kenchen@xxxxxxxxxx>

OK, patch looks good too, thanks!

Acked-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>

> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1d673a9..7198a26 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1465,24 +1465,34 @@ static void __set_se_shares(
> * Calculate and set the cpu's group shares.
> */
> static void
> -update_group_shares_cpu(struct task_group *tg, int cpu,
> +update_group_shares_cpu(struct task_group *tg, int cpu, int empty,
> unsigned long sd_shares, unsigned long sd_rq_weight)
> {
> - unsigned long shares;
> + unsigned long shares, raw_shares;
> unsigned long rq_weight;
>
> if (!tg->se[cpu])
> return;
>
> rq_weight = tg->cfs_rq[cpu]->rq_weight;
> -
> - /*
> - * \Sum shares * rq_weight
> - * shares = -----------------------
> - * \Sum rq_weight
> - *
> - */
> - shares = (sd_shares * rq_weight) / sd_rq_weight;
> + if (rq_weight) {
> + /*
> + * \Sum shares * rq_weight
> + * shares = -----------------------
> + * \Sum rq_weight
> + *
> + */
> + raw_shares = (sd_shares * rq_weight) / sd_rq_weight;
> + shares = raw_shares;
> + } else {
> + /*
> + * If there are currently no tasks on the cpu pretend there
> + * is one of average load so that when a new task gets to
> + * run here it will not get delayed by group starvation.
> + */
> + raw_shares = 0;
> + shares = sd_shares / empty;
> + }
> shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
>
> if (abs(shares - tg->se[cpu]->load.weight) >
> @@ -1491,7 +1501,7 @@ update_group_shares_cpu(
> unsigned long flags;
>
> spin_lock_irqsave(&rq->lock, flags);
> - tg->cfs_rq[cpu]->shares = shares;
> + tg->cfs_rq[cpu]->shares = raw_shares;
>
> __set_se_shares(tg->se[cpu], shares);
> spin_unlock_irqrestore(&rq->lock, flags);
> @@ -1508,17 +1518,12 @@ static int tg_shares_up(
> unsigned long weight, rq_weight = 0;
> unsigned long shares = 0;
> struct sched_domain *sd = data;
> - int i;
> + int i, empty = 0;
>
> for_each_cpu(i, sched_domain_span(sd)) {
> - /*
> - * If there are currently no tasks on the cpu pretend there
> - * is one of average load so that when a new task gets to
> - * run here it will not get delayed by group starvation.
> - */
> weight = tg->cfs_rq[i]->load.weight;
> if (!weight)
> - weight = NICE_0_LOAD;
> + empty++;
>
> tg->cfs_rq[i]->rq_weight = weight;
> rq_weight += weight;
> @@ -1532,7 +1537,7 @@ static int tg_shares_up(
> shares = tg->shares;
>
> for_each_cpu(i, sched_domain_span(sd))
> - update_group_shares_cpu(tg, i, shares, rq_weight);
> + update_group_shares_cpu(tg, i, empty, shares, rq_weight);
>
> return 0;
> }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/