Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Nick Piggin
Date: Mon Oct 11 2004 - 18:25:53 EST


Matthew Dobson wrote:
On Fri, 2004-10-08 at 17:18, Nick Piggin wrote:

Matthew Dobson wrote:

I think this example is easily achievable with the sched_domains
modifications I am proposing. You can still create your 128 CPU
exclusive domain, called big_domain (due to my lack of naming
creativity), and further divide big_domain into smaller, non-exclusive
sched_domains. We do this all the time, albeit statically at boot time,
with the current sched_domains code. When we create a 4-node domain on
IA64, and underneath it we create 4 1-node domains. We've now
partitioned the system into 4 sched_domains, each containing 4 cpus. Balancing between these 4 node-level sched_domains is allowed, but can
be disallowed by not setting the SD_LOAD_BALANCE flag. Your example
does show that it can be more than just a convenient way to group tasks,
but your example can be done with what I'm proposing.

You wouldn't be able to do this just with sched domains, because
it doesn't know anything about individual tasks. As soon as you
have some overlap, all your tasks can escape out of your domain.

I don't think there is a really nice way to do overlapping sets.
Those that want them need to just use cpu affinity for now.


Well, the tasks can escape out of the domain iff you have the SD_LOAD_BALANCE flag set on that domain. If SD_LOAD_BALANCE isn't set,
then when the scheduler tick goes off, and the code looks at the domain,
it will see the lack of the flag and will not attempt to balance the
domain, correct? This is what we currently do with the 'isolated'
domains, right?


Yeah that's right. Well you have to remove some of the other SD_
flags as well (eg. SD_BALANCE_EXEC, SD_WAKE_BALANCE).

But I don't think there is much point in overlapping sets which
don't do any balancing. They might as well not exist at all.

You're right that you can get some of the more obscure semantics of the
various flavors of cpusets by leveraging sched_domains AND
cpus_allowed. I don't have any desire to remove that ability, just keep
it as the exception.


I think at this stage, overlapping cpu sets are the exception. It
is pretty logical that they're going to require some per-task info,
because the balancer can't otherwise differentiate between two tasks
on the same runqueue but in different cpu sets.

sched-domains gives you a nice clean way to do exclusive partitioning,
and I can't imagine it would be too common to want to do overlapping
partitioning.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/