Re: [RFC] sched: Limit idle_balance() when it is being used toofrequently

From: Jason Low
Date: Wed Jul 17 2013 - 11:59:11 EST


Hi Peter,

On Wed, 2013-07-17 at 11:39 +0200, Peter Zijlstra wrote:
> On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote:
> > For the more complex model, are you suggesting that each completion time
> > is the time it takes to complete 1 iteration of the for_each_domain()
> > loop?
>
> Per sd, yes? So higher domains (or lower depending on how you model the thing
> in you head) have bigger CPU spans, and thus take longer to complete. Imagine
> the top domain of a 4096 cpu system, it would go look at all cpus to see if it
> could find a task.
>
> > Based on some of the data I collected, a single iteration of the
> > for_each_domain() loop is almost always significantly lower than the
> > approximate CPU idle time, even in workloads where idle_balance is
> > lowering performance. The bigger issue is that it takes so many of these
> > attempts before idle_balance actually "worked" and pulls a tasks.
>
> I'm confused, so:
>
> schedule()
> if (!rq->nr_running)
> idle_balance()
> for_each_domain(sd)
> load_balance(sd)
>
> is the entire thing, there's no other loop in there.

So if we have the following:

for_each_domain(sd)
before = sched_clock_cpu
load_balance(sd)
after = sched_clock_cpu
idle_balance_completion_time = after - before

At this point, the "idle_balance_completion_time" is usually a very
small value and is usually a lot smaller than the avg CPU idle time.
However, the vast majority of the time, load_balance returns 0.

> > I initially was thinking about each "completion time" of an idle balance
> > as the sum total of the times of all iterations to complete until a task
> > is successfully pulled within each domain.
>
> So you're saying that normally idle_balance() won't find a task to pull? And we
> need many times going newidle before we do get something?

Yes, a while ago, I collected some data on the rate in which
idle_balance() does not pull tasks, and it was a very high number.

> Wouldn't this mean that there simply weren't enough tasks to keep all cpus busy?

If I remember correctly, in a lot of those load_balance attempts when
the machine is under a high Java load, there were no "imbalance" between
the groups in each sched_domain.

> If there were tasks we could've pulled, we might need to look at why they
> weren't and maybe fix that. Now it could be that it things this cpu, even with
> the (little) idle time it has is sufficiently loaded and we'll get a 'local'
> wakeup soon enough. That's perfectly fine.
>
> What we should avoid is spending more time looking for tasks then we have idle,
> since that reduces the total time we can spend doing useful work. So that is I
> think the critical cut-off point.

Do you think its worth a try to consider each newidle balance attempt as
the total load_balance attempts until it is able to move a task, and
then skip balancing within the domain if a CPU's avg idle time is less
than that avg time doing newidle balance?

Thanks,
Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/