Re: [RFC] cpuset: remove sched domain hooks from cpusets
From: Paul Jackson
Date: Sun Oct 22 2006 - 08:05:01 EST
Dinikar wrote:
> On Fri, Oct 20, 2006 at 02:41:53PM -0700, Paul Jackson wrote:
> >
> > > 2. The main change is that we dont allow tasks to be added to a cpuset
> > > if it has child cpusets that also have the sched_domain flag turned on
> > > (Maybe return a EINVAL if the user tries to do that)
> >
> > This I would not like. It's ok to have tasks in cpusets that are
> > cut by sched domain partitions (which is what I think you were getting
> > at), just so long as one doesn't mind that they don't load balance
> > across the partition boundaries.
> >
> > For example, we -always- have several tasks per-cpu in the top cpuset.
> > These are the per-cpu kernel threads. They have zero interest in
> > load balancing, because they are pinned on a cpu, for their life.
>
> I cannot think of any reason why this change would affect per-cpu tasks.
You are correct that cpu pinned tasks don't mind not being load balanced.
I was reading too much into what you had suggested.
You suggested not adding tasks to cpusets that have child cpusets
defining sched domains.
But the fate of tasks that were already there was an open issue.
That kind of ordering dependency struck me as so odd that I
leapt to the false conclusion that you couldn't possibly have
meant it, and really mean to disallow any tasks in such a cpuset,
whether there before we setup the sched_domain or not.
So ... nevermind that comment.
Popping the stack, yes, as a practical matter, when setting up
some nodes to be used by real time software, we offload everything
else we can from those nodes, to minimize the interference.
We don't need more error conditions from the kernel to handle that,
not even on big systems. We just need some way to isolate those
nodes (cpu and memory) from any scheduler load balancing efforts.
This is easy. It can be done now at boottime (isolcpus=), and at
runtime with the existing cpu_exclusive overloading. It essentially
just involves setting a single system wide cpumask of isolated cpus.
In addition large systems (apparently for scheduler load balancing
performance reasons) need a way to partition sched domains down to
more reasonable sizes.
Partitioning sched domains interacts with cpusets in ways that are
still confounding us, and depends critcally on information that really
only system services such as the batch scheduler can actually provide
(such as which jobs are active.) And this, even at a mathematical
minimum, involves a partition of the systems cpus, which is a set of
subsets, or multiple cpumasks instead of just one. It's harder.
My sense is that your proposal, to try to get rid of, or not allow more
of, the tasks in the overlapping parent cpusets that were complicating
this effort, just makes life harder for such system services, forcing
them to work around our efforts to make life 'safe' for them.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/