Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rqcpu usage

From: Peter Zijlstra
Date: Wed Feb 23 2011 - 08:32:44 EST


On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:

> @@ -609,6 +631,9 @@ static void update_curr(struct cfs_rq *c
> cpuacct_charge(curtask, delta_exec);
> account_group_exec_runtime(curtask, delta_exec);
> }
> +#ifdef CONFIG_CFS_BANDWIDTH
> + account_cfs_rq_quota(cfs_rq, delta_exec);
> +#endif
> }

Not too hard to make the #ifdef'ery go away I'd guess.

> static inline void
> @@ -1382,6 +1407,43 @@ static void dequeue_task_fair(struct rq
> }
>
> #ifdef CONFIG_CFS_BANDWIDTH
> +static u64 tg_request_cfs_quota(struct task_group *tg)
> +{
> + struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> + u64 delta = 0;
> +
> + if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
> + raw_spin_lock(&cfs_b->lock);
> + /*
> + * it's possible a bandwidth update has changed the global
> + * pool.
> + */
> + if (cfs_b->quota == RUNTIME_INF)
> + delta = sched_cfs_bandwidth_slice();

Why do we bother at all when there's infinite time? Shouldn't the action
that sets it to infinite also make cfs_rq->quota_assinged to to
RUNTIME_INF, in which case the below check will make it all go away?

> + else {
> + delta = min(cfs_b->runtime,
> + sched_cfs_bandwidth_slice());
> + cfs_b->runtime -= delta;
> + }
> + raw_spin_unlock(&cfs_b->lock);
> + }
> + return delta;
> +}

Also, shouldn't this all try and steal time from other cpus when the
global limit ran out? Suppose you have say 24 cpus, and you had a short
burst where all 24 cpus had some runtime, so you distribute 240ms. But
23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
cpus while your one active cpu will then throttle.

I would much rather see all the accounting tight first and optimize
later than start with leaky stuff and try and plug holes later.

> +
> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> + unsigned long delta_exec)
> +{
> + if (cfs_rq->quota_assigned == RUNTIME_INF)
> + return;
> +
> + cfs_rq->quota_used += delta_exec;
> +
> + if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> + return;
> +
> + cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> +}

So why isn't this hierarchical?, also all this positive quota stuff
looks weird, why not decrement and try to supplement when negative?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/