Re: [PATCH 2/3] sched: enforce per-cpu utilization limits onruntime balancing

From: Peter Zijlstra
Date: Thu Feb 25 2010 - 16:44:06 EST


On Tue, 2010-02-23 at 19:56 +0100, Fabio Checconi wrote:

> /*
> + * Reset the balancing machinery, restarting from a safe runtime assignment
> + * on all the cpus/rt_rqs in the system. There is room for improvements here,
> + * as this iterates through all the rt_rqs in the system; the main problem
> + * is that after the balancing has been running for some time we are not
> + * sure that the fragmentation of the free bandwidth it produced allows new
> + * groups to run where they need to run. The caller has to make sure that
> + * only one instance of this function is running at any time.
> */
> +static void __rt_reset_runtime(void)
> {
> + struct rq *rq;
> + struct rt_rq *rt_rq;
> + struct rt_bandwidth *rt_b;
> + unsigned long flags;
> + int i;
> +
> + for_each_possible_cpu(i) {
> + rq = cpu_rq(i);
> +
> + rq->rt_balancing_disabled = 1;
> + /*
> + * Make sure that all the new calls to do_balance_runtime()
> + * see the disable flag and do not migrate anything. We will
> + * implicitly wait for the old ones to terminate entering all
> + * the rt_b->rt_runtime_lock, one by one. Note that maybe
> + * iterating over the task_groups first would be a good idea...
> + */
> + smp_wmb();
> +
> + for_each_leaf_rt_rq(rt_rq, rq) {
> + rt_b = sched_rt_bandwidth(rt_rq);
> +
> + raw_spin_lock_irqsave(&rt_b->rt_runtime_lock, flags);
> + raw_spin_lock(&rt_rq->rt_runtime_lock);
> + rt_rq->rt_runtime = rt_b->rt_runtime;
> + rt_rq->rt_period = rt_b->rt_period;
> + rt_rq->rt_time = 0;
> + raw_spin_unlock(&rt_rq->rt_runtime_lock);
> + raw_spin_unlock_irqrestore(&rt_b->rt_runtime_lock, flags);
> + }
> + }
> +}


> +/*
> + * Handle runtime rebalancing: try to push our bandwidth to
> + * runqueues that need it.
> + */
> +static void do_balance_runtime(struct rt_rq *rt_rq)
> +{
> + struct rq *rq = cpu_rq(smp_processor_id());
> + struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
> + struct root_domain *rd = rq->rd;
> + int i, weight, ret;
> + u64 rt_period, prev_runtime;
> + s64 diff;
> +
> weight = cpumask_weight(rd->span);
>
> raw_spin_lock(&rt_b->rt_runtime_lock);
> + /*
> + * The raw_spin_lock() acts as an acquire barrier, ensuring
> + * that rt_balancing_disabled is accessed after taking the lock;
> + * since rt_reset_runtime() takes rt_runtime_lock after
> + * setting the disable flag we are sure that no bandwidth
> + * is migrated while the reset is in progress.
> + */

Note that LOCK != {RMB,MB}, what you can do is order the WMB with the
UNLOCK+LOCK (== MB).

I'm thinking the WMB above is superfluous, either we are already in
do_balance() and __rt_reset_runtime() will wait for us, or
__rt_reset_runtime() will have done a LOCK+UNLOCK between setting
->rt_balancing_disabled here and we'll have done a LOCK before the read.

So we always have at least store+UNLOCK+LOCK+load, which can never be
reordered.

IOW, look at it as if the store leaks into the rt_b->rt_runtime_lock
section, in that case that lock properly serializes the store and these
loads.

> + if (rq->rt_balancing_disabled)
> + goto out;

( maybe call that label unlock )

> +
> + prev_runtime = rt_rq->rt_runtime;
> rt_period = ktime_to_ns(rt_b->rt_period);
> +
> for_each_cpu(i, rd->span) {
> struct rt_rq *iter = sched_rt_period_rt_rq(rt_b, i);
> + struct rq *iter_rq = rq_of_rt_rq(iter);
>
> if (iter == rt_rq)
> continue;

Idem to the above ordering.

> + if (iter_rq->rt_balancing_disabled)
> + continue;
> +
> raw_spin_lock(&iter->rt_runtime_lock);
> /*
> * Either all rqs have inf runtime and there's nothing to steal



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/