Re: scheduler scalability - cgroups, cpusets andload-balancing

From: Gregory Haskins
Date: Tue Jan 29 2008 - 11:48:46 EST


>>> On Tue, Jan 29, 2008 at 11:28 AM, in message
<20080129102836.be614579.pj@xxxxxxx>, Paul Jackson <pj@xxxxxxx> wrote:
> Gregory wrote:
>> I am a bit confused as to why you disable load-balancing in the
>> RT cpuset? It shouldn't be strictly necessary in order for the
>> RT scheduler to do its job (unless I am misunderstanding what you
>> are trying to accomplish?). Do you do this because you *have*
>> to in order to make real-time deadlines, or because its just a
>> further optimization?
>
> My primary motivation for cpusets originally, and for the
> sched_load_balance flag now, was not realtime, but "soft partitioning"
> of big NUMA systems, especially for batch schedulers. They sometimes
> have large cpusets which are only being used to hold smaller, per-job,
> cpusets. It is a waste of time (CPU cycles in the kernel sched code)
> to load balance those large cpusets. Load balancing doesn't scale
> easily to high CPU counts, and it's nice to avoid doing that where
> not needed.

Understood, and that makes tons of sense.

>
> See the following lkml message for a fuller explanation:
>
> http://lkml.org/lkml/2008/1/29/85
>
> As a secondary motivation, I thought that disabling load balancing on
> the RT cpuset was the right thing to do for RT needs, but I make no
> claim to knowing much about RT.

Well, I make no claim to understand the large batch systems you work on either ;) Everything you said made a ton of sense other than the RT/load-balance thing, but I think we are on the same page now.

>
> I just now realized that you added a 'root_domain' in a patch in
> late Nov and early Dec. I was on the road then, moving from
> California to Texas, and not paying much attention to Linux.

np (though I was wondering why you had no comment before ;)

>
> A couple of questions on that patch, both involving a comment it adds
> to kernel/sched.c:
>
> /*
> * We add the notion of a root-domain which will be used to define per-domain
> * variables. Each exclusive cpuset essentially defines an island domain by
> * fully partitioning the member cpus from any other cpuset. Whenever a new
> * exclusive cpuset is created, we also create and attach a new root-domain
> * object.
> */
>
> 1) What are 'per-domain' variables?

s/per-domain/per-root-domain

>
> 2) The mention of 'exclusive cpuset' is no longer correct.
>
> With the patch 'remove sched domain hooks from cpusets' cpusets
> no longer defines sched domains using the cpu_exclusive flag.
>
> With the subsequent sched_load_balance patch (see
> http://lkml.org/lkml/2007/10/6/19) cpusets uses a new per-cpuset
> flag 'sched_load_balance' to define sched domains.

Doh! Thanks for the heads up.

>
> The following revised comment might be more accurate:
>
> /*
> * We add the notion of a root-domain which will be used to define per-domain
> * variables. Each non-overlapping sched domain defines an island domain by
> * fully partitioning the member cpus from any other cpuset. Whenever a new
> * such a sched domain is created, we also create and attach a new
> root-domain
> * object. These non-overlapping sched domains are determined by the cpuset
> * configuration, via a call to partition_sched_domains().
> */
>
> It sounds like you (Gregory, others) want your RT CPUs to be in a sched
> domain, unlike the current way things are, where my cpuset code
> carefully avoids setting up a sched domain for those CPUs. However I
> still have need, in the batch scheduler case explained above, to have
> some CPUs not in any sched domain.
>
> If you require these RT sched domains to be setup differently somehow,
> in some way that is visible to partition_sched_domains, then that
> apparently means we need a per-cpuset flag to mark those RT cpusets.

I think we only need a plain-vanilla partition, so no flags should be necessary.

-Greg

>
> If you just want an ordinary sched domain setup (just so long as it
> contains only the intended RT CPUs, not others) then I guess we don't
> technically need any more per-cpuset flags, but I'm worried, because
> the API we're presenting to users for this has just gone from subtle to
> bizarre. I suspect I'll want to add a flag anyway, if by doing so, I
> can make the kernel-user API, via cpusets, easier to understand.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/