Re: scheduler scalability - cgroups, cpusets and load-balancing

From: Paul Jackson
Date: Tue Jan 29 2008 - 11:29:27 EST


Gregory wrote:
> I am a bit confused as to why you disable load-balancing in the
> RT cpuset? It shouldn't be strictly necessary in order for the
> RT scheduler to do its job (unless I am misunderstanding what you
> are trying to accomplish?). Do you do this because you *have*
> to in order to make real-time deadlines, or because its just a
> further optimization?

My primary motivation for cpusets originally, and for the
sched_load_balance flag now, was not realtime, but "soft partitioning"
of big NUMA systems, especially for batch schedulers. They sometimes
have large cpusets which are only being used to hold smaller, per-job,
cpusets. It is a waste of time (CPU cycles in the kernel sched code)
to load balance those large cpusets. Load balancing doesn't scale
easily to high CPU counts, and it's nice to avoid doing that where
not needed.

See the following lkml message for a fuller explanation:

http://lkml.org/lkml/2008/1/29/85

As a secondary motivation, I thought that disabling load balancing on
the RT cpuset was the right thing to do for RT needs, but I make no
claim to knowing much about RT.

I just now realized that you added a 'root_domain' in a patch in
late Nov and early Dec. I was on the road then, moving from
California to Texas, and not paying much attention to Linux.

A couple of questions on that patch, both involving a comment it adds
to kernel/sched.c:

/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
* fully partitioning the member cpus from any other cpuset. Whenever a new
* exclusive cpuset is created, we also create and attach a new root-domain
* object.
*/

1) What are 'per-domain' variables?

2) The mention of 'exclusive cpuset' is no longer correct.

With the patch 'remove sched domain hooks from cpusets' cpusets
no longer defines sched domains using the cpu_exclusive flag.

With the subsequent sched_load_balance patch (see
http://lkml.org/lkml/2007/10/6/19) cpusets uses a new per-cpuset
flag 'sched_load_balance' to define sched domains.

The following revised comment might be more accurate:

/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each non-overlapping sched domain defines an island domain by
* fully partitioning the member cpus from any other cpuset. Whenever a new
* such a sched domain is created, we also create and attach a new root-domain
* object. These non-overlapping sched domains are determined by the cpuset
* configuration, via a call to partition_sched_domains().
*/

It sounds like you (Gregory, others) want your RT CPUs to be in a sched
domain, unlike the current way things are, where my cpuset code
carefully avoids setting up a sched domain for those CPUs. However I
still have need, in the batch scheduler case explained above, to have
some CPUs not in any sched domain.

If you require these RT sched domains to be setup differently somehow,
in some way that is visible to partition_sched_domains, then that
apparently means we need a per-cpuset flag to mark those RT cpusets.

If you just want an ordinary sched domain setup (just so long as it
contains only the intended RT CPUs, not others) then I guess we don't
technically need any more per-cpuset flags, but I'm worried, because
the API we're presenting to users for this has just gone from subtle to
bizarre. I suspect I'll want to add a flag anyway, if by doing so, I
can make the kernel-user API, via cpusets, easier to understand.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/