Re: scheduler scalability - cgroups, cpusets and load-balancing

From: Peter Zijlstra
Date: Tue Jan 29 2008 - 06:32:53 EST



On Tue, 2008-01-29 at 05:13 -0600, Paul Jackson wrote:
> Peter wrote:
> > Thanks for the link. Yes I think your last suggestion of creating
> > rt-domains ( http://lkml.org/lkml/2007/10/23/419 ) is a good one.
>
> We now have a per-cpuset Boolean flag file called 'sched_load_balance'.

SD_LOAD_BALANCE, right?

> In the default case, this flag is set on, and the kernel does its
> usual load balancing across all CPUs in that cpuset. This means, under
> the covers, that there exists some sched domain such that all CPUs in
> that cpuset are in that same sched domain. That sched domain might
> contain additional CPUs from outside that cpuset as well. Indeed,
> in the default vanilla configuration, that sched domain contains all
> CPUs in the system.
>
> If we turn the sched_load_balance flag off for some cpuset, we are
> telling the kernel it's ok not to load balance on the CPUs in that
> cpuset (unless those CPUs are in some other cpuset that needed load
> balancing anyway.)
>
> This 'sched_load_balance' flag is, thus far, "the" cpuset hook
> supporting realtime. One can use it to configure a system so that
> the kernel does not do normal load balancing on select CPUs, such
> as those CPUs dedicated to realtime use.

Ah, here I disagree, it is possible to do (hard) realtime scheduling
over multiple cpus, the only draw back is that it requires a very strong
load-balancer, making it unsuitable for large number of cpus.

( of course, having a strong rt load balancer on a large cpuset doesn't
harm, as long as there are no rt tasks to balance )

So if we have a system like so:

__A__
/ | \
B1 B2 B3
/\
/ \
C1 C2

A comprises of cpus 0-127, !SD_LOAD_BALANCE

B1 comprises of cpus 0-63, !SD_LOAD_BALANCE
B2 comprises of cpus 64-119
B3 120-127

C1 0-3
C2 5-63

We end up with 4 disjoint load-balanced sets.

I would then attach the rt balance information to: C1, C2, B2, B3.

If, for example, B1 would be load-balanced, we'd only have 3 disjoint
sets left: B1, B2 and B3, and the rt balance data would be there.

> It sounds like Peter is reminding us that we really have three choices
> for a handling a given CPU's load balancing:
> 1) normal kernel scheduler load balancing,
> 2) RT load balancing, or
> 3) no load balancing whatsoever.
>
> If that's the case (if we really need choice 3) then a single Boolean
> flag, such as sched_load_balance, is not sufficient to select from
> the three choices, and it might make sense to add a second per-cpuset
> Boolean flag, say "sched_rt_balance", default off, which if turned on,
> enabled choice 2.
>
> If that's not the case (we only need choices 1 and 2) then -logically-
> we could overload the meaning of the current sched_load_balance,
> to mean, if turned off, not only to stop doing normal balancing, but
> to further mean that we should commence RT balancing. However bits
> aren't -that- precious here, and this sounds unnecessarily confusing.
>
> So ... would a new per-cpuset Boolean flag such as sched_rt_balance be
> appropriate and sufficient to mark those cpusets whose set of CPUs
> required RT balancing?

So, I don't think we need that, I think we can do with the single flag,
we just need to find these disjoint sets and stick our rt-domain there.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/