Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage

From: Paul Turner
Date: Thu Feb 24 2011 - 22:34:30 EST

On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>> @@ -609,6 +631,9 @@ static void update_curr(struct cfs_rq *c
>>               cpuacct_charge(curtask, delta_exec);
>>               account_group_exec_runtime(curtask, delta_exec);
>>       }
>> +     account_cfs_rq_quota(cfs_rq, delta_exec);
>> +#endif
>>  }
> Not too hard to make the #ifdef'ery go away I'd guess.


>>  static inline void
>> @@ -1382,6 +1407,43 @@ static void dequeue_task_fair(struct rq
>>  }
>> +static u64 tg_request_cfs_quota(struct task_group *tg)
>> +{
>> +     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>> +     u64 delta = 0;
>> +
>> +     if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
>> +             raw_spin_lock(&cfs_b->lock);
>> +             /*
>> +              * it's possible a bandwidth update has changed the global
>> +              * pool.
>> +              */
>> +             if (cfs_b->quota == RUNTIME_INF)
>> +                     delta = sched_cfs_bandwidth_slice();
> Why do we bother at all when there's infinite time? Shouldn't the action
> that sets it to infinite also make cfs_rq->quota_assinged to to
> RUNTIME_INF, in which case the below check will make it all go away?

The request for bandwidth might be racing with an entity's request for

e.g. someone updates cpu.cfs_quota_us to infinite while there's
bandwidth distribution in flight.

In this case we need to return some sane value so that the thread
requesting bandwidth can complete that operation (releasing the lock
which will then be taken to set quota_assigned to INF).

But more importantly we don't want to decrement the value doled out
FROM cfs_b->runtime since that would change it from the magic
RUNTIME_INF. That's why the check exists.

>> +             else {
>> +                     delta = min(cfs_b->runtime,
>> +                                     sched_cfs_bandwidth_slice());
>> +                     cfs_b->runtime -= delta;
>> +             }
>> +             raw_spin_unlock(&cfs_b->lock);
>> +     }
>> +     return delta;
>> +}
> Also, shouldn't this all try and steal time from other cpus when the
> global limit ran out? Suppose you have say 24 cpus, and you had a short
> burst where all 24 cpus had some runtime, so you distribute 240ms. But
> 23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
> cpus while your one active cpu will then throttle.

In practice this only affects the first period since that slightly
stale bandwidth is then available on those other 23 cpus the next time
a micro-burst occurs. In testing this has resulted in very stable
performance and "smooth" perturbations that resemble hardcapping by
affinity (for integer points).

The notion of stealing could certainly be introduced, the juncture of
reaching the zero point here would be a possible place to consider
that (although we would need to do a steal that avoids the asymptotic
convergence problem of RT).

I think returning (most) of the bandwidth to the global pool on
(voluntary) dequeue is a more scalable solution

> I would much rather see all the accounting tight first and optimize
> later than start with leaky stuff and try and plug holes later.

The complexity this introduces is non-trivial since the idea of
returning quota to the pool means you need to introduce the notion of
when that quota came to life (otherwise you get leaks in the reverse
direction!) -- as well as reversing some of the lock ordering.

While generations do this they don't greatly increase the efficacy and
I think there is value in performing the detailed review we are doing
now in isolation of that.

It's also still consistent regarding leakage since in that in any N
consecutive periods the maximum additional quota (by a user abusing
this) that can be received is N+1. Does the complexity trade-off
merit improving this bound at this point?

>> +
>> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> +             unsigned long delta_exec)
>> +{
>> +     if (cfs_rq->quota_assigned == RUNTIME_INF)
>> +             return;
>> +
>> +     cfs_rq->quota_used += delta_exec;
>> +
>> +     if (cfs_rq->quota_used < cfs_rq->quota_assigned)
>> +             return;
>> +
>> +     cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
>> +}
> So why isn't this hierarchical?,

It is naturally (since charging occurs within the existing hierarchal

> also all this positive quota stuff
> looks weird, why not decrement and try to supplement when negative?

I thought I had a good reason for this.. at least I remember at one
point I think I did.. but I cannot see any need for it in the current
form. I will revise it to a single decremented quota_remaining.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at